Category Archives: Data Science
AI readiness requires buy-in, technology and good governance – TechTarget
While data management and analytics are now firmly in a new era with AI by far the main focal point of users' interests and vendors' product development, readiness for AI is key for organizations before they can make use of cutting-edge capabilities.
In another era, the rise of self-service analytics required enterprises to modernize data infrastructures and develop data governance frameworks that balance setting limits on access to data depending on an employees' role while enabling their confident exploration and analysis.
Now, similarly, the era of AI requires organizations to modernize, according to Fern Halper, vice president of research at research and advisory firm TDWI. As a result, top priorities for organizations are supporting sophisticated analytics and making sure data is prepared and available for AI models and applications, according to TDWI's research.
"Organizations are trying to get ready for AI because many of them are viewing it as an imperative for something like digital transformation, competitiveness, operational efficiency and other business drivers," Halper said on July 10 during a virtual conference hosted by TDWI.
Ensuring readiness for developing and deploying AI models and applications is process, she continued. Included in the process are proper data preparation; operational readiness, including the sophisticated data platforms; and appropriate AI governance.
While technology and governance are critical aspects of AI readiness, the process of preparing for AI development and deployment begins with organizational buy-in. Those who want to use AI to surface insights and inform decisions need to get support from the executive suite that trickles down throughout the rest of the organization.
The new era of AI in data management and analytics started in November 2022 when OpenAI released ChatGPT, marking a significant improvement in generative AI capabilities.
Enterprises have long wanted to make analytics use more widespread given that data-driven decision spur growth at a higher rate than decisions made without data. However, due to the complexity of analytics data management platforms, which require coding to carry out most tasks and data literacy training to interpret outputs, analytics use has stagnated for around two decades. Only about a quarter of employees within organizations regularly use data in their workflows.
Generative AI has the potential to change that by enabling the true natural language processing that tools developed by analytics and data management vendors never could. In addition, generative AI tools can be programmed to automate repetitive tasks, which eases burdens placed on data engineers and other data experts.
As a result, many vendors have made generative AI a focus of their product development, building tools such as AI assistants that can be used in concert with an enterprise's data to enable natural language queries and analysis. Simultaneously, many enterprises have made generative AI a focus of their own development, building models and applications that can be used to generate insights and automate tasks.
Still, getting executives to recognize the importance of generative AI sometimes takes effort, according to Halper.
"None of this works if an organization isn't committed to it," she said.
Commitment is an ongoing process barely two years into this new era, Halper continued, noting that a TDWI survey showed that only 10% of respondents have a defined AI strategy in place and another 20% are in the process of implementing an AI strategy. In addition, less thar half of all respondents report that their leadership is committed to investing in the necessary resources, including the people required to work with the requisite tools, such as data operations staff.
To get executive support, it takes demonstrating that existing problems that can be solved with AI capabilities and showing the potential results, such as cost savings or increased growth.
"Your organization is going to need to be made aware of what's needed for AI," she said. "It's really best to understand the business problems you're trying to solve with AI so that you can frame [the need for AI] in a way the business leaders understand. Then you can show how you'll measure value from AI. This can take some doing, but it's necessary to engage the business stakeholders."
Assuming there's organizational support, AI readiness begins with the data at the foundation of any model or application.
Models and applications trained with high quality data will deliver high quality outcomes. Models and applications trained with low-quality data will deliver low-quality outcomes. In addition, the more quality data that can be harnessed to train an AI model or application, the more accurate it will be.
As a result, structured data such as financial and transaction records that has historically informed analytics reports and dashboards is required. In addition, unstructured data such as text and images often left unused is important.
Accessing unstructured data in addition to structured data and transforming that unstructured data to make it discoverable and usable takes a modern data platform. So does combining that data with a large language model, such as ChatGPT or Google Gemini, to apply generative AI.
A 20-year-old data warehouse doesn't have the necessary technology, which includes the compute power, to handle AI workloads. Neither does an on-premises database.
"Organizations are concerned about futureproofing their environment to handle the needs of increased data availability and workload speed and power and scalability for AI," Halper said.
Cloud data warehouses, data lakes and data lakehouses are able to handle the data volume required to inform AI models and applications. Toward that end, spending on cloud-based deployments is increasing while spending on on-premises deployments is dropping.
But that's just a start. The trustworthy data required for AI readiness remains a problem with less than half of those surveyed by TDWI reporting they have a trusted data foundation in place.
Automation can help, according to Halper. By using data management and analytics tools that themselves use AI to automate data preparation, organizations can improve data quality and the trustworthiness of insights.
Data ingestion, integration, pipeline development and curation are complex and labor intensive. Tools that automate those processes improve efficiency given that machines are much faster than humans. They also improve accuracy. No person or team of people can examine every data point among potentially millions for accuracy, whereas machines can be programmed to do so.
"Automation can play a key role in data mapping for accuracy, handling jobs and automating workflows," Halper said. "Where we're seeing most is automation and augmentation for data classification and data quality."
For example, AI-powered tools such as data observability platforms are used to scan data pipelines to identify problem areas.
"Using these intelligent tools is important," Halper said. "Organizations are realizing they need to look for tools that are going to help them with [data readiness for AI]. There are these tool organizations can make use of as they continue to scale their amount of data."
Data quality and proper technology -- in concert with organizational support -- are still not enough on their own to guarantee an enterprise's readiness for developing and deploying AI models and applications.
To protect organizations from potentially exposing sensitive information, violating regulations or simply taking actions without proper due diligence, guidelines must be in place to limit who can access certain AI models and applications as how those can be used.
When self-service analytics platforms were first developed, enabling business users to work with data in addition to the IT teams that historically oversaw all data management and analysis, organizations needed to develop data governance frameworks.
Those data governance frameworks, when done right, simultaneously enable confident self-service analysis and decision-making while protecting the enterprise from harm. As the use of AI models and applications -- specifically generative AI applications that enable more people to engage with data -- becomes more widespread within the enterprise, similar governance frameworks need to be in place for their use.
"For AI to succeed, it's going to require governance," Halper said.
AI requires new types of data, such as text and images. In addition, AI requires the use of various platforms, including data warehouses and lakes, vector databases that enable unstructured data discovery, and retrieval-augmented generation pipelines to train models and applications with relevant data.
Governance, therefore, encompasses diverse data and multiple environments to address AI readiness, according to Halper. Governance also must include oversight of the various types of AI, including generative AI, to determine whether outputs are toxic or incorrect as well as whether there's bias in a model or application.
"The future starts now, and there's a lot to think about," Halper said. "Data management is going to continue to be a journey in terms of managing new data for AI and beyond. Organizations really need to think forward strategically and not be caught off-guard."
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.
Continue reading here:
AI readiness requires buy-in, technology and good governance - TechTarget
Using AI to map research in the School of Arts & Sciences – Penn Today
When Colin Twomey became interim executive director of the Data Driven Discovery Initiative (DDDI) last summer, he says, his background in behavioral ecology meant that he had a good idea of the data science needs for his own field and some idea for biology, genetics, and evolution. However, with DDDI serving as the hub for data science education and research across the School of Arts & Sciences, Twomey says he found his understanding of the needs for chemistry, sociology, and other fields to be lacking.
To tackle the problem, he followed his instinct as an ecologist: map out the system and get a big-picture view before digging into the details. What resulted is a work-in-progress map intended to capture all published research by current faculty in SAS, including their work before coming to Penn, encompassing research that spans several decades. It uses the same technology as ChatGPT and similar large language models (LLMs).
I really think of it as like a Google Maps for research. It gives you a very fast way to get oriented to a really big and complex research environment like Penn, Twomey says. He built what he calls the University Atlas Project, or uAtlas for short, during his personal time, and its just one of the ways Penn is leading in data-driven research, teaching, and applications.
At first glance, it might look like a single-cell atlas to a scientist or an abstract design to an artist. While the map is still being worked on, each of the more than 40,000 dots is a different publication by a professorcolor-coded by their departmentand zooming in shows labels for 240 topics. Departments are assigned a specific color. Red is economics. Highlighter-orange is chemistry. Pastel yellow is psychology. Robins-egg blue is Africana studies. Hot pink is cinema and media studies and so forth.
The spatial arrangement shows how thematically similar each paper is in relation to another and illustrates the interdisciplinary pursuits of Penn faculty. Theres all sorts of really unexpected overlaps, and it also doesnt put anyone into a box, Twomey says.
The Department of Physics and Astronomy shows up as very broad, Twomey says. It has its tendrils into everything, which is kind of amazing; it really does accommodate a very broad range of interests, from social sciences and psychology to chemistry.
The multicolored pattern of dots around labels such as inequality, bioethical dilemmas, and COVID-19 impact show how researchers in psychology, sociology, political science, philosophy, economics, Africana studies, and more are leading on the great challenges of our time.
The map is also searchable by name, which shows the varied interests and cross-disciplinary work of Penn faculty. For example, the spread-out clusters for physics professor Vijay Balasubramanian reflect his interests in string theory and neuroscience.
Users can also adjust the view to show only works published before or after a selected year. Twomey was struck by a bridge of green dots, for earth and environmental science, connecting hard science subjectsand specifically the topic of past climate variabilityto the social sciences. The bridge labeled climate communication, Twomey says, didnt start appearing until after about 2004, pointing to research led by Michael Mann.
Twomey says the tool has been useful to him in identifying what is going on in different departments. And he says it can also help faculty identify potential collaborators and prospective graduate students and postdocs determine with whom they want to work. My other hope for this is that, once you do this for long enough, you get these pictures of where the University is evolving over time, where research has moved, Twomey says.
Bhuvnesh Jain, the Walter H. and Leonore C. Annenberg Professor in the Natural Sciences and faculty co-director of DDDI, says he loves that Twomeys map is both sophisticatedin its use of an LLM to embed research papers onto an abstract spaceand visually intuitive.
The map transcends discipline and sub-discipline labels and shows how closely connected a lot of our work is, Jain says, adding that he had fun brainstorming with Twomey on the applications of this tool. I am confident that the users will range from incoming Penn undergraduates to the deans of our schools, who will be able to rapidly visualize the hubs of activity, the interconnections of different research efforts, and the growth areas in different fields.
To build this map, Twomey began by figuring out the affiliations of SAS faculty, which he says was a challenge because the data live in many places across the University. He then used Python to distill the data and a large language model to map the semantic content of each publication into a high-dimensional embedding space. But Twomey says visualizing hundreds of dimensions simultaneously is impractical, so the final map compresses data into a two-dimensional representation that best preserves the relationships between papers that address similar topics.
He next used the programming language Elixir to build a custom web server so the map would appear on a user-friendly website. Twomey then used an LLM again to add the research topics, choosing a labeling system that he felt was neither too dense nor too sparse, so its not overwhelming but still gives you enough waypoints.
To date, the map captures most but not all School of Arts & Sciences faculty as Twomey continues to work on the project. He also notes that some data from indexes like Google Scholar and OpenAlex may be incorrect, meaning a professor may show up as incorrectly attached to a paper or the year is wrong, so additional validation is needed. Twomeys goal is to eventually include research from graduate students and postdocs as well and to expand beyond SAS.
The School of Arts and Sciences has 28 departments and 34 centers, and seeing how all those intersect is super fascinating, but thats just one piece, one school, Twomey says. I want to have this Penn-wide and even scale it beyond Penn in the future.
More here:
Using AI to map research in the School of Arts & Sciences - Penn Today
Brain Data Science Platform increases EEG accessibility with open data and research enabled by AWS – AWS Blog
Introduction
About 4.5 million electroencephalogram (EEG) tests are performed in the US each year. Thats more than if every person in Oregon, Connecticut, or Iowa got an EEG. Compared to magnetic resonance imaging (MRI) scans, which use magnetic fields and radio waves to generate images of the structure of the brain, EEGs use wires placed on the scalp to record the brain function as seen through the electrical activity that the brain generates in the process of neurons in the brain sending signals to each other. The Brain Data Science Platform (BDSP), hosted on Amazon Web Services (AWS), is increasing EEG accessibility through cooperative data sharing and research enabled by the cloud.
Because they provide insights into brain activity and not just structure, EEGs are one of the most common tests ordered by doctors to help make a diagnosis for people with brain problems. This includes seizures and epilepsy, coma, stroke, developmental delays in children, and sleep disorders. However, in current practice, EEGs are not always part of a diagnostic plan, even when they could provide important information. Currently, experts trained in EEG are in short supply, and methods to automatically interpret EEGs are not yet advanced enough to fill this gap. For these reasons, diagnostic plans that include EEGs are limited. The cloudincreases EEG accessibility by facilitatingdata sharing and research innovation, makingEEGs more accessible formore patients medical care plans.
Brain cells talk to each other using tiny electrical impulses and are constantly active, even during sleep.Activity is measured by placing small, metal discs (electrodes), held in place by tape or glue, at different locations on your scalp.These electrodes detect the tiny voltage fluctuations from the activity of millions of neurons in the brain.The electrodes are connected to an amplifier, which magnifies the weak electrical signals picked up by the electrodes. The acquired signals are often weak and susceptible to various types of interference, including noise from the environment or other non-brain biological sources. The amplifier performs signal conditioning operations to filter out unwanted noise and artifacts while preserving the relevant brain activity. This can involve processes such as amplification, filtering, and isolation.
Once conditioned, signals are converted from analog to digital format using an analog-to-digital converter (ADC). The ADC samples the analog signals at a specific rate and converts them into a digital representation that can be processed and analyzed by a computer. The digitized EEG signals can undergo further processing, like additional filtering, artifact removal, and feature extraction. Various algorithms and techniques can be applied to extract meaningful information from the EEG signals, depending on the specific analysis or diagnostic purpose. For example, algorithms can detect EEG activity that is normal for a given age, as shown in Figure 1.
Figure 1. A 15-second excerpt from a normal EEG recording from a 42-year-old woman who is awake. The vertical bars are 1 second apart. The channel names indicate locations on the persons head.
Algorithms can also detect harmful brain activity like a seizure, as shown in Figure 2.
Figure 2. A 15-second excerpt from an EEG recording from a 22-year-old man. The high-amplitude rhythmic brain activity is a seizure. Detecting this kind of abnormal electrical activity in the brain helps establish a diagnosis of epilepsy and helps doctors choose a treatment that can prevent future seizures.
The processed EEG signals are stored for later review or analyzed in real time. The data is visualized as waveforms or through spectral analysis, event-related potential analysis, or analysis by machine learning (ML) algorithms to detect abnormalities or patterns of interest.
In the US, about 75 percent of EEGs are interpreted by neurologists without expertise in EEG interpretation. This can lead to mistakes in tricky cases, such as when an EEG looks abnormal but really is not, so some patients who really have a heart problem get misdiagnosed instead with epilepsy.In many parts of the world, patients are unable to get an EEG because the doctors available are not trained to interpret EEGs.People who have sleep problems face similar challenges in getting a diagnosis. Doctors can order a sleep test (which includes recording EEG and other signals overnight while sleeping) to help diagnose sleep problems. However, getting the sleep test done can take a long time or may not be possible because there is a shortage of sleep specialists.
A research team at Harvard Medical School, headed by Drs. Brandon Westover and Robert Thomas at Beth Israel Deaconess Medical Center, and Drs. Valdery Moura Junior and Sahar Zafar at Massachusetts General Hospital aims to make EEGs more easily accessible by using artificial intelligence (AI) to automate medical diagnosis based on EEG.They are joined in this effort by other scientists from several institutions.[i]
The team is working to automate EEG and sleep testing interpretation by developing the Brain Data Science Platform (BDSP), the worlds largest and most diverse set of EEG and sleep testing data. Using this data, the team is constructing algorithms that diagnose sleep disorders, detect seizures and other forms of harmful brain activity in hospitalized patients who are critically ill, predict the risk of future seizures, and calculate the probability that a patient with coma due to brain damage will be able to recover consciousness. To be useful in the real world, they need to cope with EEG patternsfrom all people, regardless of age, gender, race, and ethnicity, and across a vast number of different health conditions. Thus, the algorithms that underlie automated EEG and sleep test interpretation must be well-trained and well-tested so that the resulting diagnoses are just as reliable or more so than can currently be obtained when EEGs are interpreted by human experts with specialty medical training in EEG and sleep test interpretation.
Beyond the diagnostics information currently available from EEGs, the team believes that there is hidden information, especially during sleep, that reveals insights into the health of the brain and which even experts cannot see. The understanding is that each of the different stages of sleep rapid-eye movement (REM), and light and deep stages of non-REM sleep has certain patterns that are normal for a given age and gender. Divergence from these norms can indicate positive or negative deviations from normal health.
The team is developing AI algorithms that use sleep signals to detect early signs of diseases like Parkinsons disease and Alzheimers disease. Earlier detection helps treatments to be given earlier when they can be more effective. The team has already found that information from sleep can predict life expectancy. Finally, one member of the research team, Dr. Haoqi Sun, has developed a way to measure brain age, as distinct from chronologic age. This validates the concept that someone who is 80 can have a brain that functions like someone 20 years younger. Accelerated brain aging (brain age older than chronologic age) is linked to a variety of brain health problems, including declining cognitive functioning and diseases like Alzheimers. The team believes that the ability to measure brain age, and similar types of hidden health information in sleep, may enable doctors to treat diseases more efficiently and effectively while providing more direct ways to measure the effects of those treatments on brain health.
The team has assembled a massive collection of EEG and sleep data currently more than 200,000 EEG recordings and 26,000-plus sleep tests. These span all medical settings where EEGs and sleep tests are performed, including outpatient neurology clinics, epilepsy centers, sleep centers, and home settings where data is collected using wearable consumer devices.For more valuable research, additional metadata is being collected as well. This includes diagnoses, medications, laboratory testing results, and brain imaging including head computed tomography (CT) scans and brain MRI images.
Dr.Westoverintends for any clinic or hospital in the world to benefit from the models that BDSP researchers develop. The benefits would not be limited to clinical groups with access to on-premises powerful servers and clusters.
We intend to offer the machine learning EEG interpretation models as an online service, where sites can upload their EEG data and get their results back within seconds, says Dr.Westover. If we want to engage clinical sites without such a level of internal IT infrastructure and bandwidth, we need to offer them a simple way to access our developments. This is where the cloud is going to be crucial.
Dr. Westover is collaborating with AWS to increase access to brain data and support researchers focusing on brain health. Twelve other hospitals have already committed to adding their data to BDSP on AWS, further increasing the possibilities for new discoveries by making it possible to study rare diseases, which are typically not seen often enough at any single hospital to allow rigorous research. Dr.Westoverbelieves BDSP will transform the field of brain health, paving the way for more personalized and precise ways to diagnose, treat, and prevent neurological disease.
These datasets are going to let us launch the field of precision brain health, said Dr. Westover.
With support from the AWS Open Data Sponsorship Program, the BDSP datasets are now openly available at no cost to researchers around the world. The BDSP dataset is one of the largest collections of brain data in existence.
[i] Collaborators include Junior Moura, PhD, Umakanth Katwa, MD, Wolfgang Ganglberger, PhD, Thijs Nassi, MSc, Erik-Jan Meulenbrugge, MSc, Yalda Amidi, PhD, Jin Jing, PhD, Haoqi Sun, PhD, Mouhsin Shafi, MD, PhD, Daniel Goldenholz, MD, PhD, Arjun Singh, MD, Sahar Zafar, MD, Shibani Mukerji, MD, PhD, Jurriaan Peters, MD, and Tobias Loddenkemper, MD at Harvard Medical School; Aaron Struck, MD at University of Wisconsin; Jennifer Kim, MD, PhD at Yale University; Emmanuel Mignot, MD, PhD and Chris Lee-Messer, MD, PhD at Stanford University; Gari Clifford, PhD, Samaneh Nasiri, PhD, and Lynne Marie Trotti, MD at Emory University; and Dennis Hwang, MD at Kaiser Permanente.
Read the original post:
Synergy of Generative, Analytical, Causal, and Autonomous AI – Data Science Central
The current fascination with Generative AI (GenAI) especially as manifested by OpenAIs ChatGPT has raised public awareness of Artificial Intelligence (AI) and its ability to create new sources of customer, product, service, and operational value. Leveraging GenAI tools and Large Language Models (LLMs) to generate new textual, graphical, video, and audio content is astounding.
However, lets not forget about the predictive, understandable, and continuously learning legs of AI analytical AI, which focuses on pattern recognition and prediction; causal AI, which seeks to identify and understand cause-and-effect relationships; and autonomous AI, which aims to operate independently and make real-time decisions. In the ever-evolving landscape of artificial intelligence (AI), four distinct but equally transformative branches have emerged: Generative AI, Analytical AI, Causal AI, and Autonomous AI.
As organizations strive to harness the power of data to drive decision-making and innovation, understanding the differences, similarities, and collaborative potential between these types of AI is crucial. This blog explores these facets, highlighting how combining Generative, Analytical, Causal, and Autonomous AI can unlock unprecedented economic value and create new opportunities for customer, product, service, and operational advancements (Figure 1).
Figure 1: Analytics (AI) Business Model Maturity Index
As always, lets start by establishing some definitions:
As I wrote in an earlier blog titled Generative AI: Precursor to Autonomous Analytics, Generative AI is a foundational technology leading toward developing Autonomous AI. Generative AI, with its ability to create new data and content based on existing patterns, paves the way for more sophisticated autonomous systems. These systems leverage the generative capabilities to enhance their decision-making processes, operate independently, and adapt to dynamic environments. This progression underscores the importance of understanding the interplay between these AI types to fully harness their combined potential in driving innovation and efficiency across various sectors.
Lets create a quick matrix that compares critical aspects of these four different classifications of AI (Table 1).
Table 1: Four Types of Artificial Intelligence (AI)
The synergy of Generative AI, Analytical AI, Causal AI, and Autonomous AI can profoundly impact every industry. Here are just a few examples (Figure 2):
Figure 2: Industry Use Cases: Synergizing Generative, Analytical, Causal, and Autonomous AI
These use cases demonstrate how integrating Generative AI, Analytical AI, Causal AI, and Autonomous AI can drive innovation, efficiency, and effectiveness across various industries, leveraging the strengths of each AI type to create significant value.
To fully realize the benefits of AI technologies, organizations must understand and capitalize on the distinct capabilities of Generative AI, Analytical AI, Causal AI, and Autonomous AI. By synergizing across these different types of AI, organizations can drive innovation, elevate decision-making processes, and optimize operational efficiency. The collective potential of these AI technologies emphasizes the transformative influence of AI in developing advanced, adaptable, and streamlined systems.
Go here to read the rest:
Synergy of Generative, Analytical, Causal, and Autonomous AI - Data Science Central
Building a Data Science Platform with Kubernetes | by Avinash Kanumuru | Jul, 2024 – Towards Data Science
Photo by Growtika on Unsplash
When I started in my new role as Manager of Data Science, little did I know about setting up a data science platform for the team. In all my previous roles, I had worked on building models and to some extent deploying models (or at least supporting the team that was deploying models), but I never needed to set up something from scratch (infra, I mean). The data science team did not exist then.
So first of my objective was to set up a platform, not just for the data science team in a silo, but that can be integrated with data engineering and software teams. This is when I was introduced to Kubernetes (k8s) directly. I had heard of it earlier but hadnt worked beyond creating docker images and someone else would deploy in some infra.
Now, why is Kubernetes required for the data science team? What are some of the challenges faced by data science teams?
Continued here:
10 Data Analyst Interview Questions to Land a Job in 2024 – KDnuggets
As an entry-level data analyst candidate, the job hunt can feel like a never-ending process.
Ive applied to countless data analyst interviews at the beginning of my career and was often left feeling lost and confused.
There were often edge-cases, business problems, and tricky technical questions I struggled with, and after each interview round, Id feel my confidence falter.
After spending 4 years in the industry and helping conduct entry-level interviews, however, Ive learned more about what employers are looking for in data analyst candidates.
There are typically three areas of focus that well dive into in this articletechnical expertise, business problem-solving, and soft skills.
Every interview round will cover some aspect of these broader areas, although each employer places a higher emphasis on different sets of skills.
For example, management consulting firms are big on presentation skills. They want to know if you can present complex technical insights to business stakeholders.
In this case, your soft skills and ability to problem-solve are prioritized more than the technical skill. They dont care as much about your clean Python code as they do your ability to explain the results of a hypothesis test to the stakeholder.
In contrast, product-based companies or tech startups tend to prioritize technical skills. They often test your ability to code, perform ETL tasks, and handle deliverables in a timely manner.
But I digress.
You came here to learn about how to get a job as a data analyst, so lets dive straight into the questions you are likely to encounter during the interview process.
Typically, the first round of an entry-level data analyst interview comprises a list of technical questions. This is either a timed technical test or a take-home assessmentthe results of which will be used to determine if you progress to the next level. Here are some questions you can expect during this interview round, with examples of how they can be answered:
Sample answer:
Hypothesis testing is a technique used to identify and make decisions about population parameters based on a sample dataset.
It starts by formulating a null hypothesis (H0), which represents the default assumption that there is no effect. A significance level is then chosen, which is typically 0.05 or 0.10. This is the probability threshold for which the null hypothesis will be rejected.
Statistical tests, such as the T-test, ANOVA, or the Chi-Squared Test will then be applied to test the initial hypothesis using data from the sample population.
A test statistic is then computed, along with a p-value, which is the probability of observing the test result under the null hypothesis.
If the p-value falls below the significance level, then the null hypothesis can be rejected, and there is enough evidence to support the alternative hypothesis.
Sample answer:
The T-Test and Chi-Squared test are statistical techniques used to compare the distribution of different groups of data. They are used in different scenarios.
Here are situations in which Id use each test:
Sample answer: There are various ways to handle missing data in a dataset depending on the problem statement and the variables distribution. Some common approaches include:
Sample answer: To detect outliers, I would visualize the variables using a box plot to identify the points outside the charts whiskers.
I would also calculate the Z-score for each variable and identify data points with a Z-score of +3 or -3 as they are typically outliers.
To reduce the impact of outliers, I would transform the dataset using a function like RobustScaler() in Scikit-Learn, which scales the data according to the quantile range.
I might also use a transformation like the log, square root, or BoxCox transformation to normalize the variables distribution.
Sample answer:
The Where clause is used to filter rows in a table based on individual conditions and is applied before any groupings are made.
In comparison, the Having clause is used to filter records after a table has been aggregated, and can only be used in conjunction with the Group By clause.
Sample answer:
An inner join returns only records that have matching values between tables. If there are no matching values in the dataset, the result of the inner join might be 0.
If all the rows between Table 1 and Table 2 match, then the query will return the total number of records in Table 1, which is 100.
Therefore, the range of expected records from an inner join between these tables is anywhere between 0 to 100.
Notice that the above questions are centered around data preprocessing and analysis, SQL, and statistics. In some cases, you might be given an ER diagram and some tables and be asked to write an SQL query on the spot. You might even be expected to do pair programming, where youre given a dataset and need to solve a problem together with the interviewer.
Here are a few resources that will help you ace the technical SQL interview:
1. How to learn SQL for data analysis in 2024 2. Learn SQL for data analytics in 4 hours
Lets say youve made it through the technical interview.
This means that you meet the technical requirements of the employer and are now one step closer to landing the job. But you arent out of the woods just yet.
Most data analyst interviews comprise case-study-type questions, where youll be given a dataset and asked to analyze it to solve a business problem.
Here is an example of a case-study-type question that you might encounter in a data analyst interview:
Business Case: We are launching a marketing campaign to increase product sales and brand awareness. The campaign will include a mix of in-store promotions and online ads. How will you evaluate its success?
Here is a sample answer to the question above, outlining each step that one might take when faced with the above scenario:
Similar to the technical interview, this might be an on-the-spot question, where youre presented with the problem statement and need to work out the steps to achieve a solution.
Or it could even be a take-home assessment that takes about a week to complete.
Either way, the best way to prepare for this round is to practice.
Here are some learning resources Id recommend exploring to ace this round of your data analyst interview:
1. How to solve a data analytics case study problem 2. Data analyst case study interview
Many people arent too concerned about the soft-skill round of their interview.
This is where candidates get confident that theyre about to be made an offersince theyve made it through the most difficult interview rounds.
But dont get cocky just yet.
Ive seen many promising prospects get rejected because they didnt have the right attitude or didnt match the company culture.
While this section of the interview cannot be quantified like the previous rounds and is mostly based on what impression you leave the interviewers with, it is often the qualifying factor that makes a company choose you over other candidates.
Here are some questions you might expect during this interview:
Sample answer:
In my previous role, I was asked to present complex concepts to the marketing team at my organization.
They wanted to understand how our new customer segmentation model worked and how it could be used to improve campaign performance.
I started by illustrating each concept with a visual aid. I also created personas for each customer segment, assigning names to each user group to make them more digestible to stakeholders.
The marketing team clearly understood the value behind the segmentation model and used it in a subsequent campaign, which led to a 15% improvement in sales.
Note: If you have no prior experience and this is the first data analyst position you are applying for, then you can provide an example of how you would approach this situation if faced with it in the future.
Sample answer: In my latest data analytics project, I analyzed the demand for various skills required in data-related jobs in my country. I collected data by scraping 5,000 listings on job platforms and preprocessed this data in Python. Then, I identified the prominent terms in these job listings, such as Python, SQL, and communication. Finally, I built a Tableau dashboard displaying the frequency at which each skill appeared in these job listings. I wrote an article explaining my findings from this project and uploaded my code to GitHub.
Sample answer:
I believe that the most important trait for a data analyst to have is curiosity.
In all my past projects, Ive been driven to learn more about the data I was presented with due to curiosity.
My first data analytics project, for example, was created solely due to curiosity. I wanted to understand whether female representation in Hollywood had improved over the years, and how the gender dynamic had changed over time. Upon collecting and exploring the data, I discovered that movies with female directors typically had lower ratings than those with male directors.
Instead of stopping at this surface-level analysis, I was curious to understand why this was the case.
I performed further analysis by collecting the genres of these movies and gaining a better understanding of the target audience and realized that the female-directed movies in my dataset had lower ratings due to them being concentrated in a genre that was more poorly rated.
It was correlation, not causation.
I believe that it takes a curious person to uncover these insights and dive deeper into observed trends instead of simply taking them at face value.
I recommend actually writing down your answers to some of these questions beforehandjust as you would in any other interview round.
Culture and personality fit is really important to hiring managers since an individual who doesnt adhere to the teams way of operating can cause friction further down the line.
You must research the companys culture and overall direction, and learn about how this aligns with your overall goals.
For example, if the companys environment is fast-paced and everyone is working on cutting-edge technology, gauge whether this is a place youd thrive in.
If youre someone who wants to keep up with industry trends, learn as much as possible, and move up the career ladder quickly, then this is the place for you.
Make sure to convey that message to your interviewer, who likely shares a similar ambition and passion for growth.
Similarly, if youre the kind of person who prefers a consulting environment because you enjoy client work and breaking down solutions to non-technical stakeholders, then find a company that aligns with your skills and gets the message across.
In simple terms, play to your strengths, and make sure they are conveyed to the employer.
While this might sound too simplistic, it is a better approach than simply applying to every open position you see on Indeed and wondering why youre getting nowhere in the job hunt.
If youve managed to follow along this far, congratulations!
You now understand the 3 types of questions asked in data analyst interviews and have a strong grasp of what employers are looking for in entry-level candidates.
Here are some potential next steps you can take to improve your chances of landing a job in the field:
Projects are a great way for you to stand out amongst other candidates and start getting job offers. You can watch this video to learn more about how to create projects to land your first job in the field.
I also recommend building a portfolio website to showcase all your work in one place. This will improve your visibility and maximize your chances of getting a data analyst role.
If you dont know where to start, I have an entire video tutorial teaching you to build a portfolio website from scratch with ChatGPT.
Brush up on skills like statistics, data visualization, SQL, and programming. There are countless resources that go into these topics in greater detail, and my favorites include Luke Barousses YouTube channel,W3Schools, and StatQuest.
Natassha Selvaraj is a self-taught data scientist with a passion for writing. Natassha writes on everything data science-related, a true master of all data topics. You can connect with her on LinkedIn or check out her YouTube channel.
Read more:
10 Data Analyst Interview Questions to Land a Job in 2024 - KDnuggets
Whats New in Computer Vision and Object Detection? | by TDS Editors | Jul, 2024 – Towards Data Science
4 min read
Feeling inspired to write your first TDS post? Were always open to contributions from new authors.
Before we get into this weeks selection of stellar articles, wed like to take a moment to thank all our readers, authors, and members of our broader community for helping us reach a major milestone, as our followers count on Medium just reached
We couldnt be more thrilled and grateful for everyone that has supported us in making TDS the thriving, learning-focused publication it is. Heres to more growth and exploration in the future!
Back to our regular business, weve chosen three recent articles as our highlights this week, focused on cutting-edge tools and approaches from the ever-exciting fields of computer vision and object detection. As multimodal models grow their footprint and use cases like autonomous driving, healthcare, and agriculture go mainstream, its never been more crucial for data and ML practitioners to stay up-to-speed with the latest developments. (If youre more interested in other topics at the moment, weve got you covered! Scroll down for a handful of carefully picked recommendations on neuroscience, music and AI, environmentally conscious ML workflows, and more.)
See more here:
AEGIS London reveals new Data Science and Analytics team to enhance underwriting – Reinsurance News
Lloyds syndicate AEGIS London has established a new Data Science and Analytics team to enhance underwriting capabilities and data-driven initiatives, Led by Giuseppe DAngelo, Head of Data Analytics and Portfolio Underwriting.
The newly formed team reportedly comprises a group of Data Scientists, managed by AI and Data Science expert, Dan Hirlea, and Data Analysts, managed by data visualisation expert and qualified actuary, Balint Bone.
AEGIS London noted that these two bring a wealth of experience and knowledge to the firm, allowing them to extract valuable insights, identify trends, and provide data-driven recommendations for underwriting and portfolio management strategies.
Giuseppe DAngelo added, Data Science and Analytics is an established field within general insurance and in recent years has become a specialism in the London market within high-performing syndicates.
Skill sets are highly sought after so its great that we have been able to put together two strong teams under Dan and Balint.
Day to day, the teams will be retrieving, manipulating, and visualising AEGIS London data, as well as helping put the power of analytics directly into the hands of business users, educating and collaborating with data champions across the business.
AEGIS Londons CEO Alex Powell commented, Maximising the potential of data is one of my key strategic priorities. So, with this team of experts, we will turn our rich pool of data into the raw material for decisions, insights and product development.
Weve been through a major transformation of our finance and operational systems, which has created the platform for advanced data-based decision-making.
Over time, the team will collaborate with other teams within the business to help them better understand what data they have access to and what we can achieve with it.
See the original post here:
AEGIS London reveals new Data Science and Analytics team to enhance underwriting - Reinsurance News
UAMS Awarded $1.3M for High School-Focused Tech and Data Science Program – Arkansas Business
The program will target underserved students in Northwest Arkansas. (Photo provided by UAMS)
UAMS has been awarded a $1.3 million grant from the National Institutes of Health (NIH) for its Arkansas Technology and Data Science in Health and Medicine (AR Tech-DaSH) program.
The five-year grant from the NIHs National Institute of Allergy and Infectious Diseases (NIAID) will support an outreach exposure program focused on technology and data science in health and medicine for high school students, teachers and the community, primarily in northwest Arkansas.
AR Tech-DaSH will incorporate imaging technologies and a data science curriculum focused on health and medicine into classroom outreach programs, a 10-day summer camp and community events.
The program will target underserved and underrepresented students in northwest Arkansas and will revolve around three major health concerns prevalent in the region: obesity and diabetes, cardiovascular and immunology and cancer.
The grant will fund seminars for ninth-grade classes at schools in both rural and urban districts in Northwest Arkansas. Students will experience using a variety of medical-related technologies, such as stethoscopes, ultrasound, infrared and CT imaging, as well as data science-focused activities.
The 10-day summer camp, which will be held once a year, aims to provide 25 students with an integrated exposure to medical-related skills, clinician-patient simulations, research and case-based discussions of primary health concerns. The camp also hopes to provide students with an exposure to exploratory data analysis, data transformation, data mining and machine learning using health or medicine-related datasets.
Students who attend the AR Tech-DaSH camp will be designated as STEM ambassadors and will design and implement outreach events in their local communities with input from community stakeholders.
Also part of the program are virtual outreach sessions, which will be provided to rural classrooms across the state. Virtual teacher training workshops plan to show teachers how to incorporate imaging and data science into their classroom curriculum.
The goal is to get students excited about STEM and data science careers so that the future workforce in these fields better reflects the diverse population in the U.S., Kevin Phelan, AR Tech-DaSH program director, said in a press release. Arkansas is a relatively poor, rural state with one of the lowest per capita income and education levels in the country. It faces the same challenges as other states in trying to prepare for the demands of a properly educated and diverse STEM workforce. Arkansas students desperately need early and repeated exposure to STEM and data science to be prepared not only for future careers but also to enable them to make data-driven decisions about lifestyle choices that affect their health.
Read the original post:
UAMS Awarded $1.3M for High School-Focused Tech and Data Science Program - Arkansas Business
Leveraging Python Pint Units Handler Package Part 2 | by Jose D. Hernandez-Betancur | Jul, 2024 – Towards Data Science
Create your customized unit registry for physical quantities in Python Image generated by the author using OpenAIs DALL-E.
Real-world systems, like the supply chain, often involve working with physical quantities, like mass and energy. You don't have to be a professional scientist or engineer to make an app that can scale and let users enter quantities in any unit without the app crashing. Python has a robust and constantly growing ecosystem that is full of alternatives that can be easily integrated and expanded for your application.
Within an earlier post, I talked about the Pint library, which makes working with physical quantities easy. For a more fun way to learn and put together the different parts of our programming puzzle, feel free to go back to the post .
The goal of this article is to provide more information about the Pint package so that we can create a way to store unit definitions that are made on the fly and keep them after the program ends.
See the original post: