Category Archives: Data Science
Tips for Getting the Generation Part Right in Retrieval Augmented Generation – Towards Data Science
Image created by author using Dall-E 3 Results from experiments to evaluate and compare GPT-4, Claude 2.1, and Claude 3.0 Opus
My thanks to Evan Jolley for his contributions to this piece
New evaluations of RAG systems are published seemingly every day, and many of them focus on the retrieval stage of the framework. However, the generation aspect how a model synthesizes and articulates this retrieved information may hold equal if not greater significance in practice. Many use cases in production are not simply returning a fact from the context, but also require synthesizing the fact into a more complicated response.
We ran several experiments to evaluate and compare GPT-4, Claude 2.1 and Claude 3 Opus generation capabilities. This article details our research methodology, results, and model nuances encountered along the way as well as why this matters to people building with generative AI.
Everything needed to reproduce the results can be found in this GitHub repository.
While retrieval is responsible for identifying and retrieving the most pertinent information, it is the generation phase that takes this raw data and transforms it into a coherent, meaningful, and contextually appropriate response. The generative step is tasked with synthesizing the retrieved information, filling in gaps, and presenting it in a manner that is easily understandable and relevant to the users query.
In many real-world applications, the value of RAG systems lies not just in their ability to locate a specific fact or piece of information but also in their capacity to integrate and contextualize that information within a broader framework. The generation phase is what enables RAG systems to move beyond simple fact retrieval and deliver truly intelligent and adaptive responses.
The initial test we ran involved generating a date string from two randomly retrieved numbers: one representing the month and the other the day. The models were tasked with:
For example, random numbers 4827143 and 17 would represent April 17th.
These numbers were placed at varying depths within contexts of varying length. The models initially had quite a difficult time with this task.
While neither model performed great, Claude 2.1 significantly outperformed GPT-4 in our initial test, almost quadrupling its success rate. It was here that Claudes verbose nature providing detailed, explanatory responses seemed to give it a distinct advantage, resulting in more accurate outcomes compared to GPT-4s initially concise replies.
Prompted by these unexpected results, we introduced a new variable to the experiment. We instructed GPT-4 to explain yourself then answer the question, a prompt that encouraged a more verbose response akin to Claudes natural output. The impact of this minor adjustment was profound.
GPT-4s performance improved dramatically, achieving flawless results in subsequent tests. Claudes results also improved to a lesser extent.
This experiment not only highlights the differences in how language models approach generation tasks but also showcases the potential impact of prompt engineering on their performance. The verbosity that appeared to be Claudes advantage turned out to be a replicable strategy for GPT-4, suggesting that the way a model processes and presents its reasoning can significantly influence its accuracy in generation tasks. Overall, including the seemingly minute explain yourself line to our prompt played a role in improving the models performance across all of our experiments.
We conducted four more tests to assess prevailing models ability to synthesize and transform retrieved information into various formats:
Unsurprisingly, each model exhibited strong performance in string concatenation, reaffirming previous understanding that text manipulation is a fundamental strength of language models.
As for the money formatting test, Claude 3 and GPT-4 performed almost flawlessly. Claude 2.1s performance was generally poorer overall. Accuracy did not vary considerably across token length, but was generally lower when the needle was closer to the beginning of the context window.
Despite stellar results in the generation tests, Claude 3s accuracy declined in a retrieval-only experiment. Theoretically, simply retrieving numbers should be an easier task than manipulating them as well making this decrease in performance surprising and an area where were planning further testing to examine. If anything, this counterintuitive dip only further confirms the notion that both retrieval and generation should be tested when developing with RAG.
By testing various generation tasks, we observed that while both models excel in menial tasks like string manipulation, their strengths and weaknesses become apparent in more complex scenarios. LLMs are still not great at math! Another key result was that the introduction of the explain yourself prompt notably enhanced GPT-4s performance, underscoring the importance of how models are prompted and how they articulate their reasoning in achieving accurate results.
These findings have broader implications for the evaluation of LLMs. When comparing models like the verbose Claude and the initially less verbose GPT-4, it becomes evident that the evaluation criteria must extend beyond mere correctness. The verbosity of a models responses introduces a variable that can significantly influence their perceived performance. This nuance may suggest that future model evaluations should consider the average length of responses as a noted factor, providing a better understanding of a models capabilities and ensuring a fairer comparison.
Link:
Tips for Getting the Generation Part Right in Retrieval Augmented Generation - Towards Data Science
5 Data Analyst Projects to Land a Job in 2024 – KDnuggets
I got my first data analytics internship back in 2020.
Ever since then, Ive transitioned into a senior-level full-time role, landed multiple freelance data analytics gigs, and consulted for companies in different parts of the world.
During this time, I have reviewed resumes for data analyst positions and even shortlisted candidates for jobs.
And I noticed one thing that separated the most prominent applicants from everyone else.
Projects.
Even if you have zero experience in the data industry and no technical background, you can stand out from everyone else and get hired solely based on the projects you display on your resume.
In this article, Im going to show you how to create projects that help you stand out from the competition and land your first data analyst job.
If youre reading this article, you probably already know that it is important to display projects on your resume.
You might even have built a few projects of your own after taking an online course or boot camp.
However, many data analytics projects do more harm to your portfolio than good. These projects can actually lower your chances of getting a job and must be avoided at all costs.
For example, if youve taken the popular Google Data Analytics Certificate on Coursera, youve probably done the capstone project that comes with this certification.
However, over 2 million other people have enrolled in the same course, and have potentially completed the same capstone project.
Chances are, recruiters have seen these projects on the resume of hundreds of applicants, and will not be impressed by it.
A similar logic applies to any other project that has been created many times.
Creating a project using the Titanic, Iris, or Boston Housing dataset on Kaggle can be a valuable learning experience, but should not be displayed on your portfolio.
If you want a competitive edge over other people, you need to stand out.
Heres how.
A project that stands out must be unique.
Pick a project that:
Much of the advice on data analytics projects on the Internet is inaccurate and unhelpful.
You will be told to create generic projects like an analysis of the Titanic datasetprojects that add no real value to your resume.
Unfortunately, the people telling you to do these things arent even working in the data industry, so you must be discerning when taking this advice.
In this article, I will be showing you examples of real people who have landed jobs in data analytics because of their portfolio projects.
You will learn about the types of projects that actually get people hired in this field so that you can potentially build something similar.
The first project is a dashboard displaying job trends in the data industry.
I found this project in a video created by Luke Barousse, a former lead data analyst who also specializes in content creation.
Here is a screenshot of this dashboard:
The above dashboard is called SkillQuery, and it displays the top technologies and skills that employers are looking for in the data industry.
For instance, we can tell by looking at the dashboard that the top language that employers are looking for in data scientists is Python, followed by SQL and R.
The reason this project is so valuable is because it solves an actual problem.
Every job-seeker wants to know the top skills that employers are looking for in their field so they can prepare accordingly.
SkillQuery helps you do exactly this, in the form of an interactive dashboard that you can play around with.
The creator of this project has displayed crucial data analytics skills such as Python, web scraping, and data visualization.
You can find a link to this projects GitHub repository here.
This project was created to predict whether a person will be approved for a credit card or not.
I found it in the same video created by Luke Barousse, and the creator of this project ended up getting a full-time role as a data analyst.
The credit card approval model was deployed as a Streamlit application:
You simply need to answer the questions displayed on this dashboard, and the app will tell you whether or not you have been approved for a credit card.
Again, this is a creative project that solves a real-world problem with a user-friendly dashboard, which is why it stood out to employers.
The skills displayed in this project include Python, data visualization, and cloud storage.
This project, which I created a few years ago, involves conducting sentiment analysis on content from YouTube and Twitter.
Ive always enjoyed watching YouTube videos and was particularly fascinated by channels that created makeup tutorials on the platform.
At that time, a huge scandal surfaced on YouTube involving two of my favorite beauty influencersJames Charles and Tati Westbrook.
I decided to analyze this scandal by scraping data on YouTube and Twitter.
I built a sentiment analysis model to gauge public sentiment of the feud and even created visualizations to understand what people were saying about these influencers.
Although this project had no direct business application, it was interesting since I analyzed a topic I was passionate about.
I also wrote a blog post outlining my findings, which you can find here.
The skills demonstrated in this project include web scraping, API usage, Python, data visualization, and machine learning.
This is another project that was created by me.
In this project, I built a K-Means clustering model with Python using a dataset on Kaggle.
I used variables such as gender, age, and income to create various segments of mall customers:
Since the dataset used for this project is popular, I tried to differentiate my analysis from the rest.
After developing the segmentation model, I went a step further by creating consumer profiles for each segment and devising targeted marketing strategies.
Because of these additional steps I took, my project was tailored to the domain of marketing and customer analytics, increasing my chances of getting hired in the field.
I have also created a tutorial on this project, providing a step-by-step guide for building your own customer segmentation model in Python.
The skills demonstrated in this project include Python, unsupervised machine learning, and data analysis.
The final project on this list is a dashboard displaying insights on Udemy courses:
I found this project in a Medium article written by Zach Quinn, who currently is a senior data engineer at Forbes.
Back when he was just starting out, Zach says that this dashboard landed him a data analyst job offer from a reputable company.
And its easy to see why.
Zach went beyond simply using SQL and Python to process and analyze data.
He has incorporated data communication best practices into this dashboard, making it engaging and visually appealing.
Just by looking at the dashboard, you can gain key insights about Udemys courses, its students interests, and its competitors.
The dashboard also demonstrates metrics that are vital to businesses, such as customer engagement and market trends.
Among all the projects listed in this article, I like this one the most since it goes beyond technical skills and displays the analysts adeptness in data storytelling and presentation.
Here is a link to Zachs article where he provides the code and steps taken to create this project.
I hope that the projects described in this article have inspired you to create one of your own.
If you dont have any project ideas or face obstacles when developing your own, I recommend utilizing generative AI models for assistance.
ChatGPT, for example, can provide a wealth of project ideas and even generate fake datasets, allowing you to hone your analytical skills.
Engaging with ChatGPT for data analysis will allow you to learn new technologies faster and become more efficient, helping you stand out from the competition.
If youd like to learn more about using ChatGPT and generative AI for data analysis, you can watch my video tutorial on the topic.
Natassha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on everything data science-related, a true master of all data topics. You can connect with her on LinkedIn or check out her YouTube channel.
View post:
A Proof of the Central Limit Theorem | by Sachin Date | Apr, 2024 – Towards Data Science
Lets return to our parade of topics. An infinite series forms the basis for generating functions which is the topic I will cover next.
The trick to understanding Generating Function is to appreciate the usefulness of aLabel Maker.
Imagine that your job is to label all the shelves of newly constructed libraries, warehouses, storerooms, pretty much anything that requires an extensive application of labels. Anytime they build a new warehouse in Boogersville or revamp a library in Belchertown (I am not entirely making these names up), you get a call to label its shelves.
So imagine then that you just got a call to label out a shiny new warehouse. The aisles in the warehouse go from 1 through 26, and each aisle runs 50 spots deep and 5 shelves tall.
You could just print out 6500 labels like so:
A.1.1, A.1.2,,A.1.5, A.2.1,A.2.5,,A50.1,,A50.5, B1.1,B2.1,,B50.5,.. and so on until Z.50.5,
And you could present yourself along with your suitcase stuffed with 6500 florescent dye coated labels at your local airport for a flight to Boogersville. It might take you a while to get through airport security.
Or heres an idea. Why not program the sequence into your label maker? Just carry the label maker with you. At Boogersville, load the machine with a roll of tape, and off you go to the warehouse. At the warehouse, you press a button on the machine, and out flows the entire sequence for aisle A.
Your label maker is the generating function for this, and other sequences like this one:
A.1.1, A.1.2,,A.1.5, A.2.1,A.2.5,,A50.1,,A50.5
In math, a generating function is a mathematical function that you design for generating sequences of your choosing so that you dont have to remember the entire sequence.
If your proof uses a sequence of some kind, its often easier to substitute the sequence with its generating function. That instantly saves you the trouble of lugging around the entire sequence across your proof. Any operations, like differentiation, that you planned to perform on the sequence, you can instead perform them on its generating function.
But wait theres more. All of the above advantages are magnified whenever the generating sequence has a closed form like the formula for e to the power x that we saw earlier.
A really simple generating function is the one shown in the figure below for the following infinite sequence: 1,1,1,1,1,:
As you can see, a generating sequence is actually a series.
A slightly more complex generating sequence, and a famous one, is the one that generates a sequence of (n+1) binomial coefficients:
Each coefficient nCk gives you the number of different ways of choosing k out of n objects. The generating function for this sequence is the binomial expansion of (1 + x) to the power n:
In both examples, its the coefficients of the x terms that constitute the sequence. The x terms raised to different powers are there primarily to keep the coefficients apart from each other. Without the x terms, the summation will just fuse all the coefficients into a single number.
The two examples of generating functions I showed you illustrate applications of the modestly named Ordinary Generating Function. The OGF has the following general form:
Another greatly useful form is the Exponential Generating Function (EGF):
Its called exponential because the value of the factorial term in the denominator increases at an exponential rate causing the values of the successive terms to diminish at an exponential rate.
The EGF has a remarkably useful property: its k-th derivative, when evaluated at x=0 isolates out the k-th element of the sequence a_k. See below for how the 3rd derivative of the above mentioned EGF when evaluated at x=0 gives you the coefficient a_3. All other terms disappear into nothingness:
Our next topic, the Taylor series, makes use of the EGF.
The Taylor series is a way to approximate a function using an infinite series. The Taylor series for the function f(x) goes like this:
In evaluating the first two terms, we use the fact that 0! = 1! = 1.
f(a), f(a), f(a), etc. are the 0-th, 1st, 2nd, etc. derivatives of f(x) evaluated at x=a. f(a) is simple f(a). The value a can be anything as long as the function is infinitely differentiable at x = a, that is, its k-th derivative exists at x = a for all k from 1 through infinity.
In spite of its startling originality, the Taylor series doesnt always work well. It creates poor quality approximations for functions such as 1/x or 1/(1-x) which march off to infinity at certain points in their domain such as at x = 0, and x = 1 respectively. These are functions with singularities in them. The Taylor series also has a hard time keeping up with functions that fluctuate rapidly. And then there are functions whose Taylor series based expansions will converge at a pace that will make continental drifts seem recklessly fast.
But lets not be too withering of the Taylor series imperfections. What is really astonishing about it is that such an approximation works at all!
The Taylor series happens be to one of the most studied, and most used mathematical artifacts.
On some occasions, the upcoming proof of the CLT being one such occasion, youll find it useful to split the Taylor series in two parts as follows:
Here, Ive split the series around the index r. Lets call the two pieces T_r(x) and R_r(x). We can express f(x) in terms of the two pieces as follows:
T_r(x) is known as the Taylor polynomial of order r evaluated at x=a.
R_r(x) is the remainder or residual from approximating f(x) using the Taylor polynomial of order r evaluated at x=a.
By the way, did you notice a glint of similarity between the structure of the above equation, and the general form of a linear regression model consisting of the observed value y, the modeled value _capX, and the residual e?
But lets not dim our focus.
Returning to the topic at hand, Taylors theorem, which well use to prove the Central Limit Theorem, is what gives the Taylors series its legitimacy. Taylors theorem says that as x a, the remainder term R_r(x) converges to 0 faster than the polynomial (x a) raised to the power r. Shaped into an equation, the statement of Taylors theorem looks like this:
One of the great many uses of the Taylor series lies in creating a generating function for the moments of random variable. Which is what well do next.
The k-th moment of a random variable X is the expected value of X raised to the k-th power.
This is known as the k-th raw moment.
The k-th moment of X around some value c is known as the k-th central moment of X. Its simply the k-th raw moment of (X c):
The k-th standardized moment of X is the k-th central moment of X divided by k-th power of the standard deviation of X:
The first 5 moments of X have specific values or meanings attached to them as follows:
After the 4th moment, the interpretations become assuredly murky.
With so many moments flying around, wouldnt it be terrific to have a generating function for them? Thats what the Moment Generating Function (MGF) is for. The Taylor series makes it super-easy to create the MGF. Lets see how to create it.
Well define a new random variable tX where t is a real number. Heres the Taylor series expansion of e to the power tX evaluated at t = 0:
Lets apply the Expectation operator on both sides of the above equation:
By linearity (and scaling) rule of expectation: E(aX + bY) = aE(X) + bE(Y), we can move the Expectation operator inside the summation as follows:
Recall that E(X^k] are the raw moments of X for k = 0,1,23,
Lets compare Eq. (2) with the general form of an Exponential Generating Function:
What do we observe? We see that E(X^k] in Eq. (2) are the coefficients a_k in the EGF. Thus Eq. (2) is the generating function for the moments of X, and so the formula for the Moment Generating Function of X is the following:
The MGF has many interesting properties. Well use a few of them in our proof of the Central Limit Theorem.
Remember how the k-th derivative of the EGF when evaluated at x = 0 gives us the k-th coefficient of the underlying sequence? Well use this property of the EGF to pull out the moments of X from its MGF.
The zeroth derivative of the MGF of X evaluated at t = 0 is obtained by simply substituting t = 0 in Eq. (3). M_X(t=0) evaluates to 1. The first, second, third, etc. derivatives of the MGF of X evaluated at t = 0 are denoted by M_X(t=0), M_X(t=0), M_X(t=0), etc. They evaluate respectively to the first, second, third etc. raw moments of X as shown below:
This gives us our first interesting and useful property of the MGF. The k-th derivative of the MGF evaluated at t = 0 is the k-th raw moment of X.
The second property of MGFs which well find useful in our upcoming proof is the following: if two random variables X and Y have identical Moment Generating Functions, then X and Y have identical Cumulative Distribution Functions:
If X and Y have identical MGFs, it implies that their mean, variance, skewness, kurtosis, and all higher order moments (whatever humanly unfathomable aspects of reality those moments might represent) are all one-is-to-one identical. If every single property exhibited by the shapes of X and Ys CDF is correspondingly the same, youd expect their CDFs to also be identical.
The third property of MGFs well use is the following one that applies to X when X scaled by a and translated by b:
The fourth property of MGFs that well use applies to the MGF of the sum of n independent, identically distributed random variables:
A final result, before we prove the CLT, is the MGF of a standard normal random variable N(0, 1) which is the following (you may want to compute this as an exercise):
Speaking of the standard normal random variable, as shown in Eq. (4), the first, second, third, and fourth derivatives of the MGF of N(0, 1) when evaluated at t = 0 will give you the first moment (mean) as 0, the second moment (variance) as 1, the third moment (skew) as 0, and the fourth moment (kurtosis) as 1.
And with that, the machinery we need to prove the CLT is in place.
Let X_1, X_2,,X_n be n i. i. d. random variables that form a random sample of size n. Assume that weve drawn this sample from a population that has a mean and variance .
Let X_bar_n be the sample mean:
Let Z_bar_n be the standardized sample mean:
The Central Limit Theorem states that as n tends to infinity, Z_bar_n converges in distribution to N(0, 1), i.e. the CDF of Z_bar_n becomes identical to the CDF of N(0, 1) which is often represented by the Greek letter (phi):
To prove this statement, well use the property of the MGF (see Eq. 5) that if the MGFs of X and Y are identical, then so are their CDFs. Here, itll be sufficient to show that as n tends to infinity, the MGF of Z_bar_n converges to the MGF of N(0, 1) which as we know (see Eq. 8) is e to the power t/2. In short, wed want to prove the following identity:
Lets define a random variable Z_k as follows:
Well now express the standardized mean Z_bar_n in terms of Z_k as shown below:
Next, we apply the MGF operator on both sides of Eq. (9):
By construction, Z_1/n, Z_2/n, , Z_n/n are independent random variables. So we can use property (7a) of MGFs which expresses the MGF of the sum of n independent random variables:
By their definition, Z_1/n, Z_2/n, , Z_n/n are also identical random variables. So we award ourselves the liberty to assume the following:
Z_1/n = Z_2/n = = Z_n/n = Z/n.
Therefore using property (7b) we get:
Finally, well also use the property (6) to express the MGF of a random variable (in this case, Z) that is scaled by a constant (in this case, 1/n) as follows:
With that, we have converted our original goal of finding the MGF of Z_bar_n into the goal of finding the MGF of Z/n.
M_Z(t/n) is a function like any other function that takes (t/n) as a parameter. So we can create a Taylor series expansion of M_Z(t/n) at t = 0 as follows:
Next, we split this expansion into two parts. The first part is a finite series of three terms corresponding to k = 0, k = 1, and k = 2. The second part is the remainder of the infinite series:
In the above series, M, M, M, etc. are the 0-th, 1st, 2nd, and so on derivatives of the Moment Generating Function M_Z(t/n) evaluated at (t/n) = 0. Weve seen that these derivatives of the MGF happen to be the 0-th, 1st, 2nd, etc. moments of Z.
The 0-th moment, M(0), is always 1. Recall that Z is, by its construction, a standard normal random variable. Hence, its first moment (mean), M(0), is 0, and its second moment (variance), M(0), is 1. With these values in hand, we can express the above Taylor series expansion as follows:
Another way to express the above expansion of M_Z is as the sum of a Taylor polynomial of order 2 which captures the first three terms of the expansion, and a residue term that captures the summation:
Weve already evaluated the order-2 Taylor polynomial. So our task of finding the MGF of Z is now further reduced to calculating the remainder term R_2.
Before we tackle the task of computing R_2, lets step back and review what we want to prove. We wish to prove that as the sample size n tends to infinity, the standardized sample mean Z_bar_n converges in distribution to the standard normal random variable N(0, 1):
To prove this we realized that it was sufficient to prove that the MGF of Z_bar_n will converge to the MGF of N(0, 1) as n tends to infinity.
And that led us on a quest to find the MGF of Z_bar_n shown first in Eq. (10), and which I am reproducing below for reference:
But it is really the limit of this MGF as n tends to infinity that we not only wish to calculate, but also show it to be equal to e to the power t/2.
To make it to that goal, well unpack and simplify the contents of Eq. (10) by sequentially applying result (12) followed by result (11) as follows:
Here we come to an uncomfortable place in our proof. Look at the equation on the last line in the above panel. You cannot just force the limit on the R.H.S. into the large bracket and zero out the yellow term. The trouble with making such a misinformed move is that there is an n looming large in the exponent of the large bracket the very n that wants to march away to infinity. But now get this: I said you cannot force the limit into the large bracket. I never said you cannot sneak it in.
So we shall make a sly move. Well show that the remainder term R_2 colored in yellow independently converges to zero as n tends to infinity no matter what its exponent is. If we succeed in that endeavor, common-sense reasoning suggests that it will be legal to extinguish it out of the R.H.S., exponent or no exponent.
To show this, well use Taylors theorem which I introduced in Eq. (1), and which I am reproducing below for your reference:
Well bring this theorem to bear upon our pursuit by setting x to (t/n), and r to 2 as follows:
Next, we set a = 0, which instantly allows us to switch the limit:
(t/n) 0, to,
n , as follows:
Now we make an important and not entirely obvious observation. In the above limit, notice how the L.H.S. will tend to zero as long as n tends to infinity independent of what value t has as long as its finite. In other words, the L.H.S. will tend to zero for any finite value of t since the limiting behavior is driven entirely by the (n) in the denominator. With this revelation comes the luxury to drop t from the denominator without changing the limiting behavior of the L.H.S. And while were at it, lets also swing over the (n) to the numerator as follows:
Let this result hang in your mind for a few seconds, for youll need it shortly. Meanwhile, lets return to the limit of the MGF of Z_bar_n as n tends to infinity. Well make some more progress on simplifying the R.H.S of this limit, and then sculpting it into a certain shape:
It may not look like it, but with Eq. (14), we are literally two steps away from proving the Central Limit Theorem.
All thanks to Jacob Bernoullis blast-from-the-past discovery of the product-series based formula for e.
So this will be the point to fetch a few balloons, confetti, party horns or whatever.
Ready?
Here, we go:
View post:
A Proof of the Central Limit Theorem | by Sachin Date | Apr, 2024 - Towards Data Science
Pandas: From Messy To Beautiful. This is how to make your pandas code | by Anna Zawadzka | Mar, 2024 – Towards Data Science
Scripting around a pandas DataFrame can turn into an awkward pile of (not-so-)good old spaghetti code. Me and my colleagues use this package a lot and while we try to stick to good programming practices, like splitting code in modules and unit testing, sometimes we still get in the way of one another by producing confusing code.
I have gathered some tips and pitfalls to avoid in order to make pandas code clean and infallible. Hopefully youll find them useful too. We'll get some help from Robert C. Martin's classic Clean code specifically for the context of the pandas package. TL;DR at the end.
Lets begin by observing some faulty patterns inspired by real life. Later on, well try to rephrase that code in order to favor readability and control.
Pandas DataFrames are value-mutable [2, 3] objects. Whenever you alter a mutable object, it affects the exact same instance that you originally created and its physical location in memory remains unchanged. In contrast, when you modify an immutable object (eg. a string), Python goes to create a whole new object at a new memory location and swaps the reference for the new one.
This is the crucial point: in Python, objects get passed to the function by assignment [4, 5]. See the graph: the value of df has been assigned to variable in_df when it was passed to the function as an argument. Both the original df and the in_df inside the function point to the same memory location (numeric value in parentheses), even if they go by different variable names. During the modification of its attributes, the location of the mutable object remains unchanged. Now all other scopes can see the changes too they reach to the same memory location.
Actually, since we have modified the original instance, its redundant to return the DataFrame and assign it to the variable. This code has the exact same effect:
Heads-up: the function now returns None, so be careful not to overwrite the df with None if you do perform the assignment: df = modify_df(df).
In contrast, if the object is immutable, it will change the memory location throughout the modification just like in the example below. Since the red string cannot be modified (strings are immutable), the green string is created on top of the old one, but as a brand new object, claiming a new location in memory. The returned string is not the same string, whereas the returned DataFrame was the exact same DataFrame.
The point is, mutating DataFrames inside functions has a global effect. If you dont keep that in mind, you may:
Well fix that problem later, but here is another don't before we pass to do's
The design from the previous section is actually an anti-pattern called output argument [1 p.45]. Typically, inputs of a function will be used to create an output value. If the sole point of passing an argument to a function is to modify it, so that the input argument changes its state, then its challenging our intuitions. Such behavior is called side effect [1 p.44] of a function and those should be well documented and minimized because they force the programmer to remember the things that go in the background, therefore making the script error-prone.
When we read a function, we are used to the idea of information going in to the function through arguments and out through the return value. We dont usually expect information to be going out through the arguments. [1 p.41]
Things get even worse if the function has a double responsibility: to modify the input and to return an output. Consider this function:
It does return a value as you would expect, but it also permanently modifies the original DataFrame. The side effect takes you by surprise - nothing in the function signature indicated that our input data was going to be affected. In the next step, we'll see how to avoid this kind of design.
To eliminate the side effect, in the code below we have created a new temporary variable instead of modifying the original DataFrame. The notation lengths: pd.Series indicates the datatype of the variable.
This function design is better in that it encapsulates the intermediate state instead of producing a side effect.
Another heads-up: please be mindful of the differences between deep and shallow copy [6] of elements from the DataFrame. In the example above we have modified each element of the original df["name"] Series, so the old DataFrame and the new variable have no shared elements. However, if you directly assign one of the original columns to a new variable, the underlying elements still have the same references in memory. See the examples:
You can print out the DataFrame after each step to observe the effect. Remember that creating a deep copy will allocate new memory, so its good to reflect whether your script needs to be memory-efficient.
Maybe for whatever reason you want to store the result of that length computation. Its still not a good idea to append it to the DataFrame inside the function because of the side effect breach as well as the accumulation of multiple responsibilities inside a single function.
I like the One Level of Abstraction per Function rule that says:
We need to make sure that the statements within our function are all at the same level of abstraction.
Mixing levels of abstraction within a function is always confusing. Readers may not be able to tell whether a particular expression is an essential concept or a detail. [1 p.36]
Also lets employ the Single responsibility principle [1 p.138] from OOP, even though were not focusing on object-oriented code right now.
Why not prepare your data beforehand? Lets split data preparation and the actual computation in separate functions.:
The individual task of creating the name_len column has been outsourced to another function. It does not modify the original DataFrame and it performs one task at a time. Later we retrieve the max element by passing the new column to another dedicated function. Notice how the aggregating function is generic for Collections.
Lets brush the code up with the following steps:
The way we have split the code really makes it easy to go back to the script later, take the entire function and reuse it in another script. We like that!
There is one more thing we can do to increase the level of reusability: pass column names as parameters to functions. The refactoring is going a little bit over the top, but sometimes it pays for the sake of flexibility or reusability.
Did you ever figure out that your preprocessing was faulty after weeks of experiments on the preprocessed dataset? No? Lucky you. I actually had to repeat a batch of experiments because of broken annotations, which could have been avoided if I had tested just a couple of basic functions.
Important scripts should be tested [1 p.121, 7]. Even if the script is just a helper, I now try to test at least the crucial, most low-level functions. Lets revisit the steps that we made from the start:
1. I am not happy to even think of testing this, its very redundant and we have paved over the side effect. It also tests a bunch of different features: the computation of name length and the aggregation of result for the max element. Plus it fails, did you see that coming?
2. This is much better we have focused on one single task, so the test is simpler. We also dont have to fixate on column names like we did before. However, I think that the format of the data gets in the way of verifying the correctness of the computation.
3. Here we have cleaned up the desk. We test the computation function inside out, leaving the pandas overlay behind. Its easier to come up with edge cases when you focus on one thing at a time. I figured out that Id like to test for None values that may appear in the DataFrame and I eventually had to improve my function for that test to pass. A bug caught!
4. Were only missing the test for find_max_element:
One additional benefit of unit testing that I never forget to mention is that it is a way of documenting your code, as someone who doesnt know it (like you from the future) can easily figure out the inputs and expected outputs, including edge cases, just by looking at the tests. Double gain!
These are some tricks I found useful while coding and reviewing other peoples code. Im far from telling you that one or another way of coding is the only correct one you take what you want from it, you decide whether you need a quick scratch or a highly polished and tested codebase. I hope this thought piece helps you structure your scripts so that youre happier with them and more confident about their infallibility.
If you liked this article, I would love to know about it. Happy coding!
TL;DR
Theres no one and only correct way of coding, but here are some inspirations for scripting with pandas:
Donts:
- dont mutate your DataFrame too much inside functions, because you may lose control over what and where gets appended/removed from it,
- dont write methods that mutate a DataFrame and return nothing because that's confusing.
Dos:
- create new objects instead of modifying the source DataFrame and remember to make a deep copy when needed,
- perform only similar-level operations inside a single function,
- design functions for flexibility and reusability,
- test your functions because this helps you design cleaner code, secure against bugs and edge cases and document it for free.
The graphs were created by me using Miro. The cover image was also created by me using the Titanic dataset and GIMP (smudge effect).
Read more from the original source:
How the New Breed of LLMs is Replacing OpenAI and the Likes – DataScienceCentral.com – Data Science Central
Of course, OpenAI, Mistral, Claude and the likes may adapt. But will they manage to stay competitive in this evolving market? Last week Databricks launched DBRX. It clearly shows the new trend: specialization, lightweight, combining multiple LLMs, enterprise-oriented, and better results at a fraction of the cost. Monolithic solutions where you pay by the token encourage the proliferation of models with billions or trillions of tokens, weights and parameters. They are embraced by companies such as Nvidia, because they use a lot of GPU and make chip producers wealthy. One of the drawbacks is the cost incurred by the customer, with no guarantee of positive ROI. The quality may also suffer (hallucinations).
In this article, I discuss the new type of architecture under development. Hallucination-free, they achieve better results at a fraction of the cost and run much faster. Sometimes without GPU, sometimes without training. Targeting professional users rather than the layman, they rely on self-tuning and customization. Indeed, there is no universal evaluation metric: laymen and experts have very different ratings and expectations when using these tools.
Much of this discussion is based on the technology that I develop for a fortune 100 company. I show the benefits, but also potential issues. Many of my competitors are moving in the same direction.
Before diving into the architecture of new LLMs, lets first discuss the current funding model. Many startups get funding from large companies such as Microsoft, Nvidia or Amazon. It means that they have to use their cloud solutions, services and products. The result is high costs for the customer. Startups that rely on vendor-neutral VC funding face a similar challenge: you cannot raise VC money by saying that you could do better and charge 1000x less. VC firms expect to make billions of dollars, not mere millions. To maintain this ecosystem, players spend a lot of money on advertising and hype. In the end, if early investors can quickly make big money through acquisitions, it is a win. What happens when clients realize ROI is negative, is unimportant. As long as it does not happen too soon! But can investors even achieve this short-term goal?
The problem is compounded by the fact that researchers believe deep neural networks (DNN) are the panacea, with issues simply fixed by using bigger data, multiple transforms to make DNN work, or front-end patches such as prompt engineering, to address foundational back-end problems. Sadly, no one works on ground-breaking innovations outside DNNs. I am an exception.
In the end, very few self-funded entrepreneurs can compete, offering a far less expensive alternative with no plan on becoming a billionaire. I may be the only one able to survive and strive, long-term. My intellectual property is open-source, patent-free, and comes with extensive documentation, source code, and comparisons. It appeals to large, traditional corporations. The word is out; it is no longer a secret. In turn, it puts pressure on big players to offer better LLMs. They can see how I do it and implement the same algorithms on their end. Or come up with their own solutions independently. Either way, the new type of architecture is pretty much the same in all cases, not much different from mine. The new Databricks LLM (DBRX) epitomizes this trend. Mine is called XLLM.
Surprisingly, none of the startups working on new LLMs consider monetizing their products via advertising: blending organic output with sponsored results relevant to the user prompt. I am contemplating doing it, with a large client interested in signing-up when the option is available.
As concisely stated by one of my clients, the main issues to address are:
In addition to blending specialized LLMs (one per top category with its own set of embeddings and other summary tables) a new trend is emerging. It consists of blending multiple LLMs focusing on the same topic, each one with its own flavor: technical, general, or based on different parameters. Then, combining these models just like XGBoost combines multiple small decisions trees to get the best from all. In short, an ensemble method.
Note that speed and accuracy result from using many small, specialized tables (embeddings and so on) as opposed to a big table with long, fixed-size embedding vectors and expensive semantic / vector search. The user selects the categories that best match his prompt. In my case, there is no neural network involved, no GPU needed, yet no latency and no hallucinations. Liability is further reduced with a local implementation, and explainable AI.
Carefully selecting input sources (in many cases, corporate repositories augmented with external data) and smart crawling to reconstruct the hidden structure (underlying taxonomy, breadcrumbs, navigation links, headings, and so on), are critical components of this architecture.
For details about xLLM (technical implementation, comparing output with OpenAI and the likes on the same prompts, Python code, input sources, and documentation), see here. I also offer a free course on the topic, here.
Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist atMLTechniques.comandGenAItechLab.com, former VC-funded executive, author (Elsevier) and patent owner one related to LLM. Vincents past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.Follow Vincent on LinkedIn.
Read the rest here:
Data science classes, bootcamps and certificates in NYC – Time Out
Data science is booming, thanks to the exponential increase in available data, advancements in technology and computing power, and the high demand for data-driven insights to inform decisions across all sectors. Data science classes and bootcamps in NYC offer the perfect opportunity to master essential data science skills like Python programming, machine learning, and data analysis. You'll learn how to extract insights from complex datasets, build predictive models, and create data visualizations. NYC is a hub of innovation and technology, where youll have unparalleled access to industry experts, networking opportunities, and real-world projects. Whether you're a seasoned professional looking to upskill or a curious beginner eager to explore the possibilities of data science, NYC offers the ideal environment to thrive in this rapidly evolving field.
Recommended: Best certificate programs in NYCRecommended: Best coding bootcamps in NYCRecommended: Best coding classes & bootcamps near youRecommended: Best data science classes and programs for high school studentsRecommended: Best digital marketing classes and certificates in NYC
Continued here:
Data science classes, bootcamps and certificates in NYC - Time Out
Do not over-think about ‘outliers’, use a student-t distribution instead – Towards Data Science
A Students t-distribution is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Students t-distribution. The Gaussian distribution is defined by the mean () and the standard deviation (). The Student t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the thickness of the distribution. This parameter assigns greater probability to events further from the mean. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We can visualize this using density plots:
Note in Figure 1 that the hill around the mean gets smaller as the degrees of freedom decrease as a result of the probability mass going to the tails, which are thicker. This property is what gives the Students t-distribution a reduced sensitivity to outliers. For more details on this matter, you can check this blog.
We load the required libraries:
So, lets skip data simulations and get serious. Well work with real data I have acquired from mice performing the rotarod test.
First, we load the dataset into our environment and set the corresponding factor levels. The dataset contains IDs for the animals, a groping variable (Genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we model only one of the trials (Trial3). We will save the other trials for a future article on modeling variation.
As the data handling implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of Trial3.
In biomedical science, categorical predictors, or grouping factors, are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.
Lets have an initial view of the data using Raincloud plots as shown by Guilherme A. Franchi, PhD in this great blog post.
Figure 2 looks different from the original by Guilherme A. Franchi, PhD because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are the ones that can be considered extreme observations that tilt the measures of central tendency (especially the mean) toward one direction. We also observe that the variances are different, so modeling also sigma can give better estimates. Our task now is to model the output using the brms package.
See the rest here:
Do not over-think about 'outliers', use a student-t distribution instead - Towards Data Science
The Many Pillars of Getting the Most Value From Your Organization’s Data – Towards Data Science
Photo by Choong Deng Xiang on Unsplash
Letmeintroduce youtoSarah, a talented and passionate data scientist, who just landed her dream job at GreenEnv, a large company that makes eco-friendly cleaning products. GreenEnv has tons of data on customers, products, and other areas of the business. They hired Sarah to unlock the hidden potential within this data, uncovering market trends, competitive advantages, and more.
Her first task: analyze customer demographics and buying habits to create targeted marketing campaigns. Confident in her abilities and excited to apply data science methods, Sarah dived into the customer database. But her initial excitement quickly faded. The data was a mess inconsistent formatting, misspelled names, and duplicate entries everywhere. Data quality was terrible. There were variations of names like Jhon Smith and Micheal Brown alongside entries like Jhonn Smtih and Michealw Brown. Emails had extra spaces and even typos like gnail.com instead of gmail.com. along with many other inaccuracies. Sarah realized the hard job ahead of her data cleaning.
Inconsistent formatting, missing values, and duplicates would lead to skewed results, giving an inaccurate picture of GreenEnvs customer base. Days turned into weeks as Sarah tirelessly cleaned the data, fixing inconsistencies, filling in gaps, and eliminating duplicates. It was a tedious process, but essential to ensure her analysis was built on a solid foundation.
Who cares about data quality?
Every year, poor data quality costs organizations an average of $12.9 million. [1]
Thankfully, after weeks of cleaning and organizing this messy data, Sarah was able to get the job doneor at least for this part..
Her next challenge came when she ventured into product data, aiming to identify top-selling items and recommend future opportunities. However, she encountered a different problem a complete lack of metadata. Product descriptions were absent, and categories were ambiguous. Basically, there wasnt enough data to help Sarah to understand the products data. Sarah realized the importance of metadata management structured information about the data itself. Without it, understanding and analyzing the data was almost impossible.
Research Shows Most Data Has Inaccuracies
Research by Experian reveals that businesses believe around 29% of their data is inaccurate in some way. [2]
Frustrated but determined, Sarah reached out to different departments to piece together information about the products. She discovered that each department used its own internal jargon and classification systems. Marketing and sales refer to the same cleaning product with different names.
As Sarah delved deeper, she found that datasets were kept in separate applications by different departments, outdated storage systems struggling to handle the growing volume of data, and Sarah had to wait for a long time for her queries to be executed. Sarah noticed also there are no clear rules on who can access what data and under what terms, without centralized control and proper access controls, the risk of unauthorized access to sensitive information increases, potentially leading to data breaches and compliance violations. The lack of data governance, a set of rules and procedures for managing data, was evident.
Data Breaches Can Be Costly
According to the Ponemon Institute, the average cost of a data breach in 2023 is $4.45 million globally, an all-time high record, with costs varying by industry and location. [3]
Each of the above issues and hurdles in Sarahs story highlighted the interconnectedness of many pillars data quality, metadata management, and data governance all played a crucial role in accessing and utilizing valuable insights at GreenEnv.
Sarahs journey is a common one for data scientists and analysts. Many organizations have massive amounts of data, and everyone knows the saying: Data is the new electricity. Every organization wants to make the most of their data, as its a very valuable asset. But most people mistakenly (and practically) believe that simply hiring a data analyst or data scientist is enough to unlock this value. There are many pillars to getting the most value from data, and organizations need to account for and pay attention to these. The keyword here is data management.
Did you know..
86% of organizations say they believe investing in data management directly impacts their business growth[4]
Read the original:
The Many Pillars of Getting the Most Value From Your Organization's Data - Towards Data Science
8 Things Most Data Science Programs Don’t Teach (But You Should Know) Part 2 – Towards Data Science
MIT calls this the missing semester of your CS education 10 min read
What data science and software engineering have in common is writing code. But while code is the main outcome of software engineering, data science projects typically end with models, results, and reports. Consequently, in data science the quality, structure, and delivery of code is often an afterthought at best.
The implicit expectation with data science projects is that the results reported at the end can be trusted.
This means that if someone asked you to re-run your or somebody elses analysis, you would be able to obtain the same results, regardless of how much time has passed since you first performed the analysis.
Similarly, if you are developing a component for a product, the implicit expectation is that component you developed represents the best possible performance given what is reasonably possible within the requirements of the product.
These statements may seem obvious, but satisfying both expectations can be quite difficult.
If you dont believe me, think about your past projects.
Continued here:
8 Things Most Data Science Programs Don't Teach (But You Should Know) Part 2 - Towards Data Science
Advancing drug discovery with AI: introducing the KEDD framework – EurekAlert
image:
A simple but effective feature fusion framework that jointly incorporates biomolecular structures, knowledge graphs, and biomedical texts for AI drug discovery.
Credit: [Yizhen Luo, Institute for AI Industry Research (AIR), Tsinghua University]
A transformative study published in Health Data Science, a Science Partner Journal, introduces a groundbreaking end-to-end deep learning framework, known as Knowledge-Empowered Drug Discovery (KEDD), aimed at revolutionizing the field of drug discovery. This innovative framework adeptly integrates structured and unstructured knowledge, enhancing the AI-driven exploration of molecular dynamics and interactions.
Traditionally, AI applications in drug discovery have been constrained by their focus on singular tasks, neglecting the rich tapestry of structured and unstructured data that could enrich their predictive accuracy. These limitations are particularly pronounced when dealing with novel compounds or proteins, where existing knowledge is scant or absent, often hampered by the prohibitive costs of manual data annotation.
Professor Zaiqing Nie, from Tsinghua University's Institute for AI Industry Research, emphasizes the enhancement potential of AI in drug discovery through KEDD. This framework synergizes data from molecular structures, knowledge graphs, and biomedical literature, offering a comprehensive approach that transcends the limitations of conventional models.
At its core, KEDD employs robust representation learning models to distill dense features from various data modalities. Following this, it integrates these features through a fusion process and leverages a predictive network to ascertain outcomes, facilitating its application across a spectrum of AI-facilitated drug discovery endeavors.
The study substantiates KEDD's effectiveness, showcasing its ability to outperform existing AI models in critical drug discovery tasks. Notably, KEDD demonstrates resilience in the face of the 'missing modality problem,' where lack of documented data on new drugs or proteins could undermine analytical processes. This resilience stems from its innovative use of sparse attention and modality masking techniques, which harness the power of existing knowledge bases to inform predictions and analyses.
Looking forward, Yizhen Luo, a key contributor to the KEDD project, outlines ambitious plans to enhance the framework's capabilities, including the exploration of multimodal pre-training strategies. The overarching objective is to cultivate a versatile, knowledge-driven AI ecosystem that accelerates biomedical research, delivering timely insights and recommendations to advance therapeutic discovery and development.
Health Data Science
Toward Unified AI Drug Discovery with Multimodal Knowledge
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.
Read the original here:
Advancing drug discovery with AI: introducing the KEDD framework - EurekAlert