Category Archives: Data Science

130 Data Science Terms Every Data Scientist Should Know | by Anjolaoluwa Ajayi | . | Jan, 2024 – Medium

So lets start right away, shall we?

1. A/B Testing: A statistical method used to compare two versions of a product, webpage, or model to determine which performs better.

2. Accuracy: The measure of how often a classification model correctly predicts outcomes among all instances it evaluates.

3. Adaboost: An ensemble learning algorithm that combines weak classifiers to create a strong classifier.

4. Algorithm: A step-by-step set of instructions or rules followed by a computer to solve a problem or perform a task.

5. Analytics: The process of interpreting and examining data to extract meaningful insights.

6. Anomaly Detection: Identifying unusual patterns or outliers in data.

7. ANOVA (Analysis of Variance): A statistical method used to analyze the differences among group means in a sample.

8. API (Application Programming Interface): A set of rules that allows one software application to interact with another.

9. AUC-ROC (Area Under the ROC Curve): A metric that tells us how well a classification model is doing overall, considering different ways of deciding what counts as a positive or negative prediction.

10. Batch Gradient Descent: An optimization algorithm that updates model parameters using the entire training dataset (different from mini-batch gradient descent)

11. Bayesian Statistics: A statistical approach that combines prior knowledge with observed data.

12. BI (Business Intelligence): Technologies, processes, and tools that help organizations make informed business decisions.

13. Bias: An error in a model that causes it to consistently predict values away from the true values.

14. Bias-Variance Tradeoff: The balance between the error introduced by bias and variance in a model.

15. Big Data: Large and complex datasets that cannot be easily processed using traditional data processing methods.

16. Binary Classification: Categorizing data into two groups, such as spam or not spam.

17. Bootstrap Sampling: A resampling technique where random samples are drawn with replacement from a dataset.

18. Categorical data: variables that represent categories or groups and can take on a limited, fixed number of distinct values.

19. Chi-Square Test: A statistical test used to determine if there is a significant association between two categorical variables.

20. Classification: Categorizing data points into predefined classes or groups.

21. Clustering: Grouping similar data points together based on certain criteria.

22. Confidence Interval: A range of values used to estimate the true value of a parameter with a certain level of confidence.

23. Confusion Matrix: A table used to evaluate the performance of a classification algorithm.

24. Correlation: A statistical measure that describes the degree of association between two variables.

25. Covariance: A measure of how much two random variables change together.

26. Cross-Entropy Loss: A loss function commonly used in classification problems.

27. Cross-Validation: A technique to assess the performance of a model by splitting the data into multiple subsets for training and testing.

28. Data Cleaning: The process of identifying and correcting errors or inconsistencies in datasets.

29. Data Mining: Extracting valuable patterns or information from large datasets.

30. Data Preprocessing: Cleaning and transforming raw data into a format suitable for analysis.

31. Data Visualization: Presenting data in graphical or visual formats to aid understanding.

32. Decision Boundary: The dividing line that separates different classes in a classification problem.

33. Decision Tree: A tree-like model that makes decisions based on a set of rules.

34. Dimensionality Reduction: Reducing the number of features in a dataset while retaining important information.

35. Eigenvalue and Eigenvector: Concepts used in linear algebra, often employed in dimensionality reduction to transform and simplify complex datasets.

36. Elastic Net: A regularization technique that combines L1 and L2 penalties.

37. Ensemble Learning: Combining multiple models to improve overall performance and accuracy.

38. Exploratory Data Analysis (EDA): Analyzing and visualizing data to understand its characteristics and relationships.

39. F1 Score: A metric that combines precision and recall in classification models.

40. False Positive and False Negative: Incorrect predictions in binary classification.

41. Feature: data column thats used as the input for ML models to make predictions.

42. Feature Engineering: Creating new features from existing ones to improve model performance.

43. Feature Extraction: Reducing the dimensionality of data by selecting important features.

44. Feature Importance: Assessing the contribution of each feature to the models predictions.

45. Feature Selection: Choosing the most relevant features for a model.

46. Gaussian Distribution: A type of probability distribution often used in statistical modeling.

47. Geospatial Analysis: Analyzing and interpreting patterns and relationships within geographic data.

48. Gradient Boosting: An ensemble learning technique where weak models are trained sequentially, each correcting the errors of the previous one.

49. Gradient Descent: An optimization algorithm used to minimize the error in a model by adjusting its parameters.

50. Grid Search: A method for tuning hyperparameters by evaluating models at all possible combinations.

51. Heteroscedasticity: Unequal variability of errors in a regression model.

52. Hierarchical Clustering: A method of cluster analysis that organizes data into a tree-like structure of clusters, where each level of the tree shows the relationships and similarities between different groups of data points.

53. Hyperparameter: A parameter whose value is set before the training process begins.

54. Hypothesis Testing: A statistical method to test a hypothesis about a population parameter based on sample data.

55. Imputation: Filling in missing values in a dataset using various techniques.

56. Inferential Statistics: A branch of statistics that involves making inferences about a population based on a sample of data.

57. Information Gain: A measure used in decision trees to assess the effectiveness of a feature in classifying data.

58. Interquartile Range (IQR): A measure of statistical dispersion, representing the range between the first and third quartiles.

59. Joint Plot: A type of data visualization in Seaborn used for exploring relationships between two variables and their individual distributions.

60. Joint Probability: The probability of two or more events happening at the same time, often used in statistical analysis.

61. Jupyter Notebook: An open-source web application for creating and sharing documents containing live code, equations, visualizations, and narrative text.

62. K-Means Clustering: A popular algorithm for partitioning a dataset into distinct, non-overlapping subsets.

63. K-Nearest Neighbors (KNN): A simple and widely used classification algorithm based on how close a new data point is to other data points.

64. L1 Regularization: Adding the absolute values of coefficients as a penalty term to the loss function.

65. L2 Regularization (Ridge): Adding the squared values of coefficients as a penalty term to the loss function.

66. Linear Regression: A statistical method for modeling the relationship between a dependent variable and one or more independent variables.

67. Log Likelihood: The logarithm of the likelihood function, often used in maximum likelihood estimation.

68. Logistic Function: A sigmoid function used in logistic regression to model the probability of a binary outcome.

69. Logistic Regression: A statistical method for predicting the probability of a binary outcome.

70. Machine Learning: A subset of artificial intelligence that enables systems to learn and make predictions from data.

71. Mean Absolute Error (MAE): A measure of the average absolute differences between predicted and actual values.

72. Mean Squared Error (MSE): A measure of the average squared difference between predicted and actual values.

73. Mean: The average value of a set of numbers.

74. Median: The middle value in a set of sorted numbers.

75. Metrics: Criteria used to assess the performance of a machine learning model, such as accuracy, precision, recall, and F1 score.

76. Model Evaluation: Assessing the performance of a machine learning model using various metrics.

77. Multicollinearity: The presence of a high correlation between independent variables in a regression model.

78. Multi-Label Classification: Assigning multiple labels to an input, as opposed to just one.

79. Multivariate Analysis: Analyzing data with multiple variables to understand relationships between them.

80. Naive Bayes: A probabilistic algorithm based on Bayes theorem used for classification.

81. Normalization: Scaling numerical variables to a standard range.

82. Null Hypothesis: A statistical hypothesis that assumes there is no significant difference between observed and expected results.

83. One-Hot Encoding: A technique to convert categorical variables into a binary matrix for machine learning models.

84. Ordinal Variable: A categorical variable with a meaningful order but not necessarily equal intervals.

85. Outlier: An observation that deviates significantly from other observations in a dataset.

86. Overfitting: A model that performs well on the training data but poorly on new, unseen data.

87. Pandas: A standard data manipulation library for Python for working with structured data.

88. Pearson Correlation Coefficient: A measure of the linear relationship between two variables.

89. Poisson Distribution: A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space.

90. Precision: The ratio of true positive predictions to the total number of positive predictions made by a classification model.

91. Predictive Analytics: Using data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes.

92. Principal Component Analysis (PCA): A dimensionality reduction technique that transforms data into a new framework of features, simplifying the information while preserving its fundamental patterns.

93. Principal Component: The axis that captures the most variance in a dataset in principal component analysis.

94. P-value: The probability of obtaining a result as extreme as, or more extreme than, the observed result during hypothesis testing.

95. Q-Q Plot (Quantile-Quantile Plot): A graphical tool to assess if a dataset follows a particular theoretical distribution.

96. Quantile: A data point or set of data points that divide a dataset into equal parts.

97. Random Forest: An ensemble learning method that constructs a multitude of decision trees and merges them together for more accurate and stable predictions.

98. Random Sample: A sample where each member of the population has an equal chance of being selected.

99. Random Variable: A variable whose possible values are outcomes of a random phenomenon.

See the original post here:

130 Data Science Terms Every Data Scientist Should Know | by Anjolaoluwa Ajayi | . | Jan, 2024 - Medium

Philosophy and Data Science Thinking Deeply about Data | by Jarom Hulet | Jan, 2024 – Towards Data Science

Part 3: CausalityImage by Cottonbro Studios from Pexels.com

My hope is that by the end of this article you will have a good understanding of how philosophical thinking around causation applies to your work as a data scientist. Ideally you will have a deeper philosophical perspective to give context to your work!

This is the third part in a multi-part series about philosophy and data science. Part 1 covers how the theory of determinism connects with data science and part 2 is about how the philosophical field of epistemology can help you think critically as a data scientist.

Introduction

I love how many philosophical topics take a seemingly obvious concept, like causality, and make you realize it is not as simple as you think. For example, without looking up a definition, try to define causality off the top of your head. That is a difficult task for me at least! This exercise hopefully nudged you to realize that causality isnt as black and white as you may have thought.

Here is what this article will cover:

Causalitys Unobservability

David Hume, a famous skeptic and one of my favorite philosophers, made the astute observation that we cannot observe causality directly with our senses. Heres a classic example: we can see a baseball flying towards the window and we can see the window break, but we cannot see the causality directly. We cannot

More here:

Philosophy and Data Science Thinking Deeply about Data | by Jarom Hulet | Jan, 2024 - Towards Data Science

2024: The Year of the Value-Driven Data Person | by Mikkel Dengse | Jan, 2024 – Towards Data Science

Its been a whirlwind if you worked in tech over the past few years.

VC funding declined by 72% from 2022 to 2023

New IPOs fell by 82% from 2021 to 2022

More than 150,000 tech workers were laid off in the US in 2023

During the heydays until 2021, funding was easy to come by, and teams couldnt grow fast enough. In 2022, growth at all costs was replaced with profitability goals. Budgets were no longer allocated based on finger-in-the-air goals but were heavily scrutinized by the CFO.

Data teams were not isolated from this. A 2023 survey by dbt found that 28% of data teams planned on reducing headcount.

Looking at the number of data roles in selected scale-ups, compared to the start of last year, more have reduced headcount than have expanded.

Data teams now find themselves at a chasm.

On one hand, the ROI of data teams has historically been difficult to measure. On the other hand, AI is having its moment (according to a survey by MIT Technology Review, 81% of executives believe that AI will be a significant competitive advantage for their business). AI & ML projects often have clearer ROI cases, and data teams are at the center of this, with an increasing number of machine learning systems being powered by the data warehouse.

So, whats data people to do in 2024?

Below, Ive looked into five steps you can take to make sure youre well-positioned and aligned to business value if you work in a data role.

People like it when they get to share their opinions about you. It makes them feel listened to and gives you a chance to learn about your weak spots. You should lean into this and proactively ask your important stakeholders for feedback.

While you may not want to survey everyone in the company, you can create a group of people most reliant on data, such as everyone in a senior role. Ask them to give candid, anonymous feedback on questions such as happiness about self-serve, the quality of their dashboards, and if there are enough data people in their area (this also gives you some ammunition before asking for headcount).

End with the question, If you had a magic wand, what would you change? to allow them to come up with open-ended suggestions.

Survey resultsdata about data teams data work. It doesnt get better

Be transparent with the survey results and play it back for the stakeholders with a clear action plan for addressing areas that need improvement. If you run the survey every six months and put your money where your mouth is, you can come back and show meaningful improvements. Make sure to collect data about which business area the respondents work in. This will give you some invaluable insights into where youve got blind spots and if there are specific pain points in business areas, you didnt know about.

You can sit back and wait for stakeholder requests to come to you. But if youre like most data people, you want to have a say in what projects you work on and may even have some ideas yourself.

Back in my days as a Data Analyst at Google, one of the business unit directors shared a wise piece of advice: If you want me to buy into your project, present it to me as if you were a founder raising capital for your startup. This may sound like Silicon Valley talk, but he had some valid points when I dug into it.

ExampleML model business case proposal summary

Business case proposals like the one above are presented to a few of the senior stakeholders in your area to get buy-in that you should spend your time here instead of one of the hundreds of other things you could be doing. It gives them a transparent forum for becoming part of the project and being brought in from the get-goand also a way to shoot down projects early where the opportunity is too small or the risk too big.

Projects such as a new ML model or a new project to create operational efficiencies are particularly well suited for this. But even if youre asked to revamp a set of data models or build a new company-wide KPI dashboard, applying some of the same principles can make sense.

When you think about cost, its easy to end up where you cant see the forest for the trees. For example, it may sound impressive that a data analyst can shave off $5,000/month by optimizing some of the longest-running queries in dbt. But while these achievements shouldnt be ignored, a more holistic approach to cost savings is helpful.

Start by asking yourself what all the costs of the data team consist of and what the implications of this are.

If you take a typical mid-sized data team in a scaleup, its not uncommon to see the three largest cost drivers be disproportionately allocated as:

This is not to say that you should immediately be focusing on headcount reduction, but if your cost distribution looks anything like the above, ask yourself questions like:

Should we have 2x FTEs build this in-house tool, or could we buy it instead?

Are there low-value initiatives where an expensive head count is tied up?

Are two weeks of work for a $5,000 cost-saving a good return on investment?

Are there optimizations in the development workflow, such as the speed of CI/CD checks, that could be improved to free up time?

Ive seen teams get bogged down by having 10,000s dbt tests across thousands of data models. Its hard to know which ones are important, and developing new data models takes twice as long as everything is scrutinized through the same lens.

On the other hand, teams who barely test their data pipelines and build data models that dont follow solid data modeling principles too often find themselves slowed down and have to spend twice as much time cleaning up and fixing issues retrospectively.

The value-drive data person carefully balances speed and quality through

They also know that to be successful, their company needs to operate more like a speed boat and less like a tankertaking quick turns as you learn through experiments what works and what doesnt, reviewing progress every other week, and giving autonomy to each team to set their direction.

Data teams often operate under uncertainty (e.g., will this ML model work). The faster you ship, the quicker you learn what works and what doesnt. The best data people are careful always to keep this in mind and know where on the curve they fall.

For example, if youre an ML engineer working on a model to decide which customers can sign up for a multi-billion dollar neobank, you can no longer get away with quick and dirty work. But if youre working in a seed-stage startup where the entire backend may be redone in a few months, you know sometimes to balance speed over quality.

People in data roles are often not the ones to shout the loudest about their achievements. While nobody wants to be a shameless self-promoter, theres a balance to strive towards.

If youve done work that had an impact, dont be afraid to let your colleagues know. Its even better if you have some numbers to back it up (who better to put numbers to the impact of data work than you). When doing this, its easy to get bogged down by implementation details of how hard it was to build, the fancy algorithm you used, or how many lines of code you wrote. But stakeholders care little about this. Instead, consider this framing.

Dont be afraid to call out early when things are not progressing as expected. For example, call it out if youre working on a project going nowhere or getting increasingly complex. You may fear that you put yourself at risk by doing so, but your stakeholders will perceive it as showing a high level of ownership and not falling for the sunk cost fallacy.

Follow this link:

2024: The Year of the Value-Driven Data Person | by Mikkel Dengse | Jan, 2024 - Towards Data Science

Bayesian Inference: A Unified Framework for Perception, Reasoning, and Decision-making – Towards Data Science

Photo by Janko Ferli on Unsplash

the most important questions in lifeare indeed, for the most part, only problems in probability. One may even say, strictly speaking, that almost all of our knowledge is only probable.

Pierre-Simon Laplace, Philosophical Essay on Probabilities

Over 200 years ago, French mathematician Pierre-Simon Laplace recognized that most problems we face are inherently probabilistic and that most of our knowledge is based on probabilities rather than absolute certainties. With this premise, he fully developed Bayes theorem, a fundamental theory of probability, without being aware that the English reverend Thomas Bayes (also a statistician and philosopher) had described the theorem sixty years ago. The theorem, therefore, was named after Bayes, although Laplace did most of the mathematical work to complete it.

In contrast to its long history, Bayes theorem has come into the spotlight only in recent decades, finding a prominent surge in its applications in diverse disciplines, with the growing realization that the theorem more closely aligns with our perception and cognitive processes. It manifests the dynamic adjustment of probabilities informed by both new data and pre-existing knowledge. Moreover, it explains the iterative and evolving nature of our knowledge-acquiring and decision-making.

In addition, Bayesian inference has become a powerful technique for building predictive models and making model selections, applied broadly in various fields in scientific research and data science. Using Bayesian statistics in deep learning is also a vibrant area under active study.

This article will first review the basics of Bayes theorem and its application in Bayesian inference and statistics. We will next explore how the Bayesian framework unifies our understanding of perception, human cognition, and decision-making. Ultimately, we will gain insights into the current state and challenges of Bayesian intelligence and the interplay between human and artificial intelligence in the near future.

Bayes theorem begins with the mathematical notion of conditional probability, the probability

See original here:

Bayesian Inference: A Unified Framework for Perception, Reasoning, and Decision-making - Towards Data Science

Harnessing data science to turn information into investment insights – HedgeWeek

PARTNER CONTENT

Northern Trusts Investment Data Science capabilities offer a curated suite of solutions that help institutions digitise their investment process, enabling faster and smarter investment decisions. Paul Fahey, Head of Investment Data Science, chats to Hedgeweek about growth opportunities and the drivers of client demand and current challenges facing the hedge fund industry

Where do you see the most significant opportunities for growth in the coming year?

Technology has altered how managers operate and how they analyse their data, and there are opportunities for hedge fund managers to make better use of decision-support tools. Perhaps the greatest opportunity for growth is related to artificial intelligence (AI) and generative artificial intelligence (Gen AI), which has the potential to revolutionise how businesses operate.

Northern Trust continues to research this technology to make processes more efficient. The desire to incorporate Gen AI into business models will continue to grow due to its capacity to manage data, streamline content creation and workflow processes, enhance risk assessment and foster innovation. Successful execution will depend on the development of human skillsets and the quality and the quantity of the data underpinning the models.

Can you outline the most impactful drivers of client demand in the coming year?

Access to quality data and advanced technologies remain critical drivers of client demand. Now more than ever, firms are looking to harness the power of data science and analytics to turn information into meaningful insights. Northern Trusts Investment Data Science and its partnership with Equity Data Science (EDS) enables clients to optimise their investment process and deliver enhanced outcomes through scalable and repeatable decision-making. The power of these tools enables institutions to enrich their investment process, delivering faster and smarter investment decisions.

What have been the biggest drivers of growth within your business?

Several factors have contributed to the growth of Northern Trusts Investment Data Science business. Notably, increased interest in data analysis due to flexible, cost-effective and powerful solutions that leverage modern technology becoming commercially viable to deploy towards improved decision-making and higher productivity. Managers are looking beyond merely collecting data and are now asking what they can do with the data to drive measurable outcomes and, ultimately, investment performance. Northern Trust offers tools to answer these questions. Another driver of growth has been the recent advancements in AI. These combined factors are shaping how investment decisions are made and portfolios are managed in a data-driven and technologically advanced environment.

Which are the most significant challenges in the hedge fund industry right now and how can they be best mitigated?

Challenges facing hedge fund managers today are similar to ones they have always faced: how to generate alpha and distribute their product. To do this, they must become better at managing their data.Yet collecting and updating data can be extremely time consuming and daunting for managers that are stretched to the limit. Changes to the competitive and regulatory environments have further impacted managers research management process, prompting a need for increased transparency and integration into investment decisions, unleashing the value of a managers own data.

With the advent of state-of-the-art cloud-based platforms, many of these challenges have solutions. As investment teams seek to differentiate themselves, leveraging leading technology to improve decision-making is critical. Digitising the process enables the capture of decisions that were not made or taken, adding further intelligence to the managers ability to generate alpha and attract clients.

What role can technology play in portfolio risk management?

Recent advancements in technology provide the tools and analytical capabilities to assess, monitor and mitigate risks associated with investment portfolios. Technology can help aggregate data inputs, simplify research management and idea generation, streamline risk modelling and backtesting, and create a transparent, digital environment to execute, measure, and refine the investment process. Integrating advanced technologies that provide real-time data analysis, can help assess risk and enable portfolio risk mitigation strategies contributes to the overall resiliency and stability of an investment portfolio.

Paul Fahey, Head of Investment Data Science, Northern Trust

Follow this link:

Harnessing data science to turn information into investment insights - HedgeWeek

The Future at the Intersection of AI, Machine Learning, and Data Science – Medriva

The Impact of AI, Machine Learning, and Data Science on the Future

Emerging technologies such as Artificial Intelligence (AI), machine learning, and data science are driving a seismic shift in various aspects of our society. From how we work, communicate, and even how we live, the impact of these technologies is far-reaching and transformative. As we continue to navigate the digital age, it is clear that AI, machine learning, and data science will play a pivotal role in shaping an enlightened future.

AI and machine learning are revolutionizing industries and shaping the future of work and society. From healthcare and finance to transportation, these technologies are reinventing how businesses operate. One of the key ways that AI and machine learning are transforming industries is by automating tasks and improving efficiency. This is particularly evident in the healthcare industry, where AI-driven predictive analytics are helping to improve patient outcomes and reduce costs.

AI and machine learning are not only changing how we work, but they are also creating new job opportunities. The rise of AI has led to the emergence of new roles such as AI specialists and data scientists. Furthermore, generative AI is reshaping the future of work by making it possible for anyone to learn new skills and knowledge quickly and easily. This opens up new opportunities for upskilling and reskilling, ensuring that workers can adapt to the changing job landscape.

As AI continues to advance, the quality of data used to train AI models becomes increasingly important. The accuracy, reliability, and trustworthiness of AI models are directly related to the quality of the data used in their training. Therefore, ensuring data quality is a crucial aspect of AI model development. This involves addressing challenges to data quality and implementing best data quality practices for AI projects.

While AI and machine learning offer numerous benefits, they also raise important ethical questions. As these technologies become more widespread, concerns about job displacement and privacy issues are growing. It is therefore crucial that as we continue to develop and implement AI and machine learning technologies, we also consider their ethical implications and work towards solutions that benefit all members of society.

As we look to the future, it is clear that AI, machine learning, and data science will play a central role in shaping our world. From revolutionizing industries and creating new job opportunities to raising important ethical questions, these technologies are at the forefront of societal change. As we continue to navigate this evolving landscape, it is crucial that we understand the potential of these technologies and harness them to create an enlightened future.

More:

The Future at the Intersection of AI, Machine Learning, and Data Science - Medriva

LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab – Towards Data Science

Experimenting with Large Language Models for free (Part 2)Photo by Glib Albovsky, Unsplash

In the first part of the story, we used a free Google Colab instance to run a Mistral-7B model and extract information using the FAISS (Facebook AI Similarity Search) database. In this part, we will go further, and I will show how to run a LLaMA 2 13B model; we will also test some extra LangChain functionality like making chat-based applications and using agents. In the same way, as in the first part, all used components are based on open-source projects and will work completely for free.

Lets get into it!

A LLaMA.CPP is a very interesting open-source project, originally designed to run an LLaMA model on Macbooks, but its functionality grew far beyond that. First, it is written in plain C/C++ without external dependencies and can run on any hardware (CUDA, OpenCL, and Apple silicon are supported; it can even work on a Raspberry Pi). Second, LLaMA.CPP can be connected with LangChain, which allows us to test a lot of its functionality for free without having an OpenAI key. Last but not least, because LLaMA.CPP works everywhere, it's a good candidate to run in a free Google Colab instance. As a reminder, Google provides free access to Python notebooks with 12 GB of RAM and 16 GB of VRAM, which can be opened using the Colab Research page. The code is opened in the web browser and runs in the cloud, so everybody can access it, even from a minimalistic budget PC.

Before using LLaMA, lets install the library. The installation itself is easy; we only need to enable LLAMA_CUBLAS before using pip:

For the first test, I will be using a 7B model. Here, I also installed a huggingface-hub library, which allows us to automatically download a Llama-27b-Chat model in the GGUF format needed for LLaMA.CPP. I also installed a LangChain

The rest is here:

LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab - Towards Data Science

Where are the Vietnamese data science candidates? – VnExpress International

Students often apply to numerous internship applications throughout the semester, aiming to secure positions that might result in full-time employment after the internship concludes. As someone who has experienced the internship hiring process from both perspectives, I have noticed that I have never interviewed any Vietnamese candidate during my entire tenure at my current company.

First, internship candidates can be surprisingly more qualified than some of the full-time candidates we have interviewed and extended offers to. Faced with such high qualified internship candidates, our team must carefully evaluate whether they would make suitable full-time offers down the road. A potential factor contributing to this trend is that companies intentionally target top-performing programs to source talents.

Despite the efforts of sourcing talents from diverse backgrounds, and my personal outreach to Vietnamese professional and student networks, I noticed a striking absence of Vietnamese candidates reaching the final interview stage. My heart raced when I spot a Vietnamese name among the shortlisted candidates. Over my career, among all Asian candidates, I have primarily encountered Indian and Chinese international students in interviews. As a Vietnamese professional working in data science, I could not help but ask the question: why don't Vietnamese students seek internships with my company compared to other nationalities? Could it be due to perceptions about the prestige or appeal of my company, an old life insurance company versus large tech companies like Google, Meta or Microsoft? Yet, many students from China and India still pursue opportunities with us despite those perceived differences. Our interns last year all hailed from India. They were some of the most talented data scientists that I have had a chance to work with.

According to the Open Doors report by the International Institute of Education, Vietnam ranks number five in the place of origins in sending students to the United States. In the academic year of 2022/2023, almost 22,000 Vietnamese students are studying in the United States, while China and India send close to 290,000 and 270,000 students respectively. This translates to an approximate proportion of 13 Chinese or Indian students to one Vietnamese student. However, in the professional field of data science, it does not seem to be my observation. Every year, our roster of short-listed candidates does not even have one Vietnamese student that makes it into the final interview round out of 50 students being selected for the interview rounds. So, the numbers are not enough to explain the reasons for the absence of Vietnamese candidates. I checked this observation with another Vietnamese data scientist at Quora who confirmed my observation that Vietnamese candidates are a rarity in our field despite the global data science boom since the early 2010s. Thomas Davenport and DJ Patil wrote for the Harvard Business Review and called it "The Sexiest Job of the 21st Century."

What is it about the Vietnamese higher education, or the way that Vietnamese students in Vietnam and in the U.S. study or choose their career paths that explains their absence in professional data science now?

My observation is also shared by researchers in social science who examine the issue of diversity in high-paying jobs in tech. When I was in graduate school, I had a chance to attend a book talk by sociologist France Winddance Twine. The book, titled "Geek Girls", examines how women of color such as Asian women, Black and Latino women end up working in high-paying jobs in Silicon Valley. Her research found that most female engineers especially Indian female engineers possess something called "geek capital," a form of cultural capital, refers to the ability to relate to the knowledge, abilities, and cultural competencies within the realm of geek cultures or STEM fields. Most of the times, it means they are embedded in a social network that is directly plugged into the tech culture. Some Indian female engineers stated that the reason that they became engineers or took engineering degrees in college were because they had parents or siblings who were engineers, which made it easier for them to become engineers. After the talk, I raised the question why Asian women in Silicon Valley do not help Black and Latino women in Silicon Valley to get these cushion engineering jobs that could pay up to $300,000 a year. Hearing my name, the speaker immediately made an educated guess that I am Vietnamese. Then instead of answering my question, she stated a painful fact: Vietnamese are underrepresented in Silicon Valley unlike Chinese and Indian.

The question of why Vietnamese workers are underrepresented in the high-paying tech work place in comparison to Chinese and Indian counterparts has bothered me for the past few years. There are many folk theories floating on social media to explain this phenomenon. Some posit that Indian and Chinese students help each other during internship and job interviews. For example, they would create very tight knit groups to prepare for exams akin to a college entrance exam preparation. Since the Chinese and Indian engineering education is so competitive, cramming for exams has become a regular practice, and they repeat the same exercise in the context of the United States. This explanation is necessary but not sufficient. Something goes deeper than just this group solidarity explanation. When I decided to leave the life of a social science researcher (I was trained as a sociologist with a PhD in sociology), I also prepared for similar technical interviews. I also got help from a lot of Vietnamese fellows in the United States who were working in tech and finance. However, I was the only person who prepared to interview for a data science position. I mainly figured out things on my own relying on other folks suggestions of what to do. It was a lonely experience in many ways. I wish I had a study group that enabled me to prepare for the interviews more effectively.

To better understand this observation/phenomenon of the relatively fewer Vietnamese machine learning, data science candidates, examining the educational policies that have been put in place in China, India, and Vietnam for the past five decades can shed light on additional reasons. India, as a nation, decided early in the 60s that they wanted to be in the information technology (IT) business, and built an entire educational infrastructure called the Indian Institutes of Technology (IITs), the top engineering universities in India. These schools were built with help of top private engineering schools in the United States such as Stanford, Cornell, MIT. IITs modeled their American counterparts in putting campuses in remote areas to have a more isolated campuses for students to immerse in a well-rounded engineering education. IITs with the international connection created a pipeline to send students to study abroad, earning PhDs in engineering fields in the United States. China has pursued this slightly differently by creating different tier systems where tier-1 schools such as Tsinghua, Peiking, Fudan, Shanghai-Jiao Tong universities also provide rigorous engineering education. Graduates from those schools pursue PhD in the United States after their college education, and their school names are often recognized by admission committees. Both countries pursue a top-down approach with engineering education such that engineering education is both rigorous, subsidized, and could produce out many outstanding candidates that have very good fundamental math education. Recently the Vietnamese government approved a National Digital Transformation Program, aiming to become one of the top AI players in the ASEAN region by 2025. Engineering education and AI education have received more support to enable this national ambition. As a result, it might take a few years for the labor market both within Vietnam, and globally to really observe the effects of such a national investment in tech talents.

Moving from a structural explanation to a personal one, I come back to another reason why I felt so lonely and doubtful during the job application time. I reflected at the question why I did not get an undergraduate education in engineering or computer science. I think it was a mistake. As a high school student, I earned a bronze medal in Vietnamese national Informatics Olympiad. One would think that Id make a decent engineer. I had solid mathematical reasoning, an ability to follow algorithmic reasoning, and could code, and learn how to code in a new programming language relatively quickly. The culprits of my not becoming an engineer during my 20s were my parents, or more extensively my extended family network. They somehow convinced me when I was 17 that I would not be happy being one of the only five female students in 100 students at computer science undergraduate major at Hanoi University of Science and Technology. Instead, they thought I would be happier studying at Foreign Trade University. They were half right. I was very bored at Foreign Trade University with the academic life there, where lessons didnt feel challenging enough. So, I decided to call it a quit, and found opportunities to study abroad instead. I ended up studying at a small liberal arts college in the United States, which allowed me to explore everything except engineering. Yet I still wanted to do some math, so I majored in mathematics-economics. Life went on until I was in my last two years of my PhD in sociology when I had to decide whether I wanted to become a professor of sociology, or deciding to choose a different career that requires upskilling and exploring a new territory. I ended up choosing data science because it is more like a research career than any other career in tech such as software engineering, which I am anyway not equipped to do straight out of graduate school.

Reflecting on my long-winded path that led me to my current profession, I realized that unlike other women in Silicon Valley that Sociologist Twine interviewed, I had no role model of a successful engineer in my immediate family, or friend circles who would show me how to become a successful female engineer in any field. My extended family members were all working in banking; thus, I had no mental image of what it meant to become a female engineer in a male dominated field like software engineering. Sometimes I asked if my path would have been smoother if I didnt buy into my parents arguments about how difficult my social life would have been if I decided to do an undergraduate degree in engineering.

What I have learned from this journey is that there is a confluence of factors that there are very few Vietnamese candidates whose resume ever ended up on my desk. Some of it has to do with the lack of role models who are successful engineer, and the (female) candidates themselves do not choose Data Science, computer science because its a male dominated field, and that its a relatively new field that do not dominate the popular imagination as in whether its a good career. Another and more significant reason I would argue is about the structure of the Vietnamese higher educational education that doesnt yet prioritize AI development, data science, and practical engineering. This is currently changing, but I hope that changes come faster, and that more Vietnamese students will choose data science as their career paths and apply to data science internship and job opportunities.

*Nga Than is a senior data scientist, living in New York City.

Go here to see the original:

Where are the Vietnamese data science candidates? - VnExpress International

Leveraging Genomic AI to Deliver a More Accurate and Comprehensive Genome – Genetic Engineering & Biotechnology News

Sponsored content brought to you by

As sequencing costs decrease, the volume of whole genome sequencing (WGS) and whole exome sequencing (WES) continues to rise. Sequencing is just the first step. To provide the best results requires analyzing sequencing data with accelerated compute, data science and AI to read and understand the genome, from base calls to variant interpretation. The challenge is substantial.

Human genomes are complex. The current understanding according to the National Human Genome Research Institute is that compared to a reference human genome, on average, an individuals ~3B-nucleotide genome sequence will have ~4M SNVs, ~600K insertion/deletion variants, and ~25K structural variants that involve greater than 20M nucleotides.1 As of now, the clinical impact of most of these variants is unknown. Can genomic AI help us to identify the handful of clinically significant genetic variants from this vast ocean of data?

AI methods excel when large amounts of structured data can be paired with validated outcomes for training. Recent population-level sequencing efforts, as well as validation data sets like NIST Genome in a Bottle, have spurred a new category of AIGenomic AI. Genomic AI has the potential to dramatically reduce the time it takes to analyze, decipher, and interpret sequencing data, but only if the data is carefully assembled across the width of the challenge from alignment to interpretation.

DNA sequencing has substantial promise to guide healthcare and treatment if the needed tools become more accurate, easier to use, and cost effective. Illumina believes that genomic AI is an emerging tool complementary to traditional analysis methods and known biology, that can further accuracy advancements, providing a fully-featured genome including annotation and interpretation. To achieve this the company is using its access to large data and world-class AI talent to integrate genomic AI into Illuminas software products.

Three examples will be used to illustrate the utility of this advanced technologyvariant calling, annotation and prioritization, and interpretation.

The upstream DRAGEN secondary analysis pipeline improves variant calling accuracy over a larger portion of the human genome, while ensuring that these improvements are generalizable to a wide and diverse population of samples. Hardware-accelerated DRAGEN analysis won the 2020 Precision FDA germline accuracy competition in the Difficult-to-Map regions and All-Benchmark-Regions categories.2

Building on that success, Illumina added powerful and efficient machine learning (ML) algorithms that drive significant performance improvements.

DRAGEN-ML integrates closely with the existing Bayesian Variant Calling pipeline, driving germline accuracy to new heights and addressing challenges in the most difficult genomic regions. Sophisticated and efficient machine learning enables improvement in sensitivity and genotyping accuracy, recovering low-confidence false negative calls and filtering over 50% of false positive calls. Access to deep internal data and numerous collaborations have allowed us to model how Illumina sequencing reads map to a genomic reference, says Rami Mehio, Head of Software and Informatics, Illumina. Machine learning has been critical to how our engineers and their algorithms continually improve mapping sensitivity in DRAGEN.

The latest DRAGEN release, DRAGEN v4.2 with enhanced machine learning, trained on a vast amount of data, detects variants with an analytical accuracy of 99.84%, reducing both false positive and false negative rates.* This extends Illuminas lead in providing the most accurate secondary analysis in all benchmark regions compared to other solutions using PrecisionFDA v2 Truth Challenge3 benchmark data.

Delivering a comprehensive platform for genomic analysis, the team continues to invest more in machine learning algorithms for use in RNA analysis, somatic pipelines, methylation analysis and large variant calling for release in future versions of the DRAGEN platform.

Out of the tens of millions of protein-coding variants in the human genome, only 0.1% are presently annotated in clinical variant databases, while the vast majority remain variants of unknown significance (VUS).

To address this challenge, Illumina scientists have developed PrimateAI-3D, a three-dimensional convolutional neural network for variant effect prediction, trained using primate variants and 3-D protein structure. PrimateAI-3D leverages the premise that common variants from non-human primates are unlikely to cause human disease, and has been validated to identify disease-causing variants with superior accuracy across six clinical benchmarks based on real-world patient cohorts.

Published in Science, the PrimateAI-3D project helped drive a massive international collaborative effort to sequence 809 individuals from 233 primate species and create a catalog of common missense variants. Importantly, the species selected for sequencing represent close to half of Earths 521 extant primate species and cover all major primate families.4 These WGS data were used to train PrimateAI-3D with millions of primate variants.

In a related Science publication, PrimateAI-3D was used to estimate the pathogenicity of rare coding variants in over 450K UK Biobank individuals in order to improve rare-variant association tests and genetic risk prediction for common diseases and complex traits. Stratification of the missense variants using PrimateAI-3D enabled discovery of 73% more significant gene-phenotype associations in rare variant burden tests, outperforming other existing variant interpretation algorithms.5

PrimateAI-3D also enables rare-variant polygenic risk scores (PRS), which are substantially more portable to different cohorts and ancestry groups not used during model training.5 This outcome is extremely relevant as existing PRS algorithms most often train on data from individuals of European descent, which lacks generalization to individuals of other populations.

The PrimateAI-3D deep learning scores and the primate population variant database, which enables classification of 4.3M missense variants as likely benign, are publicly available to the genomics community for research use, in addition to being made available through Illumina software products.

Complementary to PrimateAI-3Ds role for protein-coding variants, Illumina scientists earlier released SpliceAI, a deep learning model for identifying pathogenic variants in the non-coding genome. Currently, clinical exome sequencing for rare disease patients is only able to detect a pathogenic variant in around one third of cases by examining the 1% of the genome that is protein-coding. Improving identification ofdisease-causing variants in the non-coding genome extends clinical sequencingbeyond the exome to the whole genome, marking an important step towards helping patients and their families.6

Explainable AI (XAI), created by and integrated in Emedgene tertiary analysis software, prioritizes variants that are most likely to solve a case. Emedgenes XAI allows users algorithms, while keeping the geneticist in full control. By definition, XAI must be accurate, secure, transparent, and efficient.

Emedgene, for hereditary disease data interpretation applications and assaysspanning genomes, exomes, targeted panels, and virtual panels, leverages its XAI and full suite of automation capabilities for users to streamline and minimize touchpoints across their end-to-end germline analysis workflows. This variant interpretation research platform for rare-genetic, hereditary cancer and other genetic diseases, and large-scale screening projects, significantly reduces time per case.

The use of genomic XAI in Emedgene mimics the work performed by a scientist and provides a full causal explanation of the most relevant variants with accompanying linked and curated evidence. Significant time savings of 50-75% are achieved per case. Emedgenes Explainable AI (XAI) simplifies the highly complex task of variant prioritization, allowing us to handle more tests every day, relates Ray Louie, PhD, Associate Director, Greenwood Genetic Center.

In addition, a study performed by Baylor Genetics showed that in a 180-sample cohort Emedgene accurately pinpointed the manually reported variants as candidates to resolve the case. The reported variants were ranked in the top 10 candidate variants in 98.4% of trio cases, in 93.0% of single proband cases, and 96.7% of all cases. Reduction of the accuracy of the model in some cases was due to incomplete variant calling or incomplete phenotypic description.7 The study clearly demonstrated that Emedgene can assist genetic laboratories in prioritizing candidate variants effectively, thereby helping to streamline lab operations.

Decades of internal development and multiple population level collaborations provide Illumina access to massive amounts of data to train new genomic AI algorithms. The data, in combination with Illuminas world-class products and talent, can help speed genomic AI on its path towards providing a better genome.

References

* Secondary analysis run times on HG002 Illumina sequencing data from PrecisionFDA Truth Challenge V2 with 34.46X coverage. DRAGEN was run on a DRAGEN v4 server with a U200 FPGA card and Machine Learning enabled. BWA GATK 4.1.4.0 was run on a local 2x Intel Xeon Gold 6126 (48 threads) with 394 GB RAM and 2TB NVME SSD using BCBIO for parallelization.

For Research Use Only.Not for use in diagnostic procedures.

Learn more illumina.com

More here:

Leveraging Genomic AI to Deliver a More Accurate and Comprehensive Genome - Genetic Engineering & Biotechnology News

Beyond Predictions: Uplift Modeling & the Science of Influence (Part I) – Towards Data Science

Illustration by the authorHands-On Approach to Uplift with Tree-Based Models

Predictive analytics has long been a cornerstone of decision-making, but what if we told you theres an alternative beyond forecasting? What if you could strategically influence the outcomes instead?

Uplift modeling holds this promise. It adds an interesting dynamic layer to traditional predictions by identifying individuals whose behavior can be influenced positively if they receive special treatments.

The application use cases are endless. In medicine, it would help identify patients for whom a medical treatment could improve their health. In retail, such a model allows for better targeting of customers for whom a promotion or personalized offering would be effective in retention.

This article is the first part of a series that explores the transformative potential of uplift modeling, shedding light on how it can reshape strategies in marketing, healthcare, and beyond. It focuses on uplift models based on decision trees and uses, as a case study, the prediction of customer conversion with the application of promotional offers

After reading this article, you will understand:

No prior knowledge is required to understand the article.The experimentations described in the article were carried out using the libraries scikit-uplift, causalml and plotly. You can find the code here on GitHub.

The best way to understand the benefit of using uplift models is through an example. Imagine a scenario where a telecommunications company aims to reduce customer churn.

A traditional ML-based approach would consist of using a model trained on historical data to predict the likelihood of current customers to churn. This would help identify customers at risk

Read the rest here:

Beyond Predictions: Uplift Modeling & the Science of Influence (Part I) - Towards Data Science