Category Archives: Data Science

Time Series Forecasting in the Age of GenAI: Make Gradient Boosting Behaves like LLMs – Towards Data Science

6 min read

The rise of Generative AI and Large Language Models (LLMs) has fascinated the entire world initializing a revolution in various fields. While the primary focus of this kind of technology has been on text sequences, further attention is now being given to expanding their capabilities to handle and process data formats beyond just text inputs.

Like in most AI areas, time series forecasting is also not immune to the advent of LLMs, but this may be a good deal for all. Time series modeling is known to be more like an art, where results are highly dependent on prior domain knowledge and adequate tuning. On the contrary, LLMs are appreciated for being task-agnostic, holding enormous potential in using their knowledge to solve variegated tasks coming from different domains. From the union of these two areas, the new frontier of time series forecasting models can be born which in the future will be able to achieve previously unthinkable results.

Read more from the original source:

Time Series Forecasting in the Age of GenAI: Make Gradient Boosting Behaves like LLMs - Towards Data Science

Forget Statistical Tests: A/B Testing Is All About Simulations – Towards Data Science

11 min read

Controlled experiments such as A/B tests are used heavily by companies.

However, many people are repelled by A/B testing due to the presence of intimidating statistical jargon including terms such as confidence, power, p-value, t-test, effect size, and so on.

In this article, I will show you that you dont need a Master in Statistics to understand A/B testing quite the opposite. In fact, simulations can replace most of those statistical artifacts that were necessary 100 years ago.

Not only this: I will also show you that the feasibility of an experiment can be measured using something that, unlike confidence and power, is understandable by anyone in the company: dollars.

Your website has a checkout page. The ML team has come out with a new recommendation model. They claim that, by embedding their recommendations into the checkout page, we can increase revenues by an astounding 5%.

See more here:

Forget Statistical Tests: A/B Testing Is All About Simulations - Towards Data Science

The Machine Learning Guide for Predictive Accuracy: Interpolation and Extrapolation – Towards Data Science

class ModelFitterAndVisualizer: def __init__(self, X_train, y_train, y_truth, scaling=False, random_state=41): """ Initialize the ModelFitterAndVisualizer class with training and testing data.

Parameters: X_train (pd.DataFrame): Training data features y_train (pd.Series): Training data target y_truth (pd.Series): Ground truth for predictions scaling (bool): Flag to indicate if scaling should be applied random_state (int): Seed for random number generation """ self.X_train = X_train self.y_train = y_train self.y_truth = y_truth

self.initialize_models(random_state)

self.scaling = scaling

# Initialize models # ----------------------------------------------------------------- def initialize_models(self, random_state): """ Initialize the models to be used for fitting and prediction.

Parameters: random_state (int): Seed for random number generation """

# Define kernel for GPR kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1.0)

# Define Ensemble Models Estimator # Decision Tree + Kernel Method estimators_rf_svr = [ ('rf', RandomForestRegressor(n_estimators=30, random_state=random_state)), ('svr', SVR(kernel='rbf')), ] estimators_rf_gpr = [ ('rf', RandomForestRegressor(n_estimators=30, random_state=random_state)), ('gpr', GaussianProcessRegressor(kernel=kernel, normalize_y=True, random_state=random_state)) ] # Decision Trees estimators_rf_xgb = [ ('rf', RandomForestRegressor(n_estimators=30, random_state=random_state)), ('xgb', xgb.XGBRegressor(random_state=random_state)), ]

self.models = [ SymbolicRegressor(random_state=random_state), SVR(kernel='rbf'), GaussianProcessRegressor(kernel=kernel, normalize_y=True, random_state=random_state), DecisionTreeRegressor(random_state=random_state), RandomForestRegressor(random_state=random_state), xgb.XGBRegressor(random_state=random_state), lgbm.LGBMRegressor(n_estimators=50, num_leaves=10, min_child_samples=3, random_state=random_state), VotingRegressor(estimators=estimators_rf_svr), StackingRegressor(estimators=estimators_rf_svr, final_estimator=RandomForestRegressor(random_state=random_state)), VotingRegressor(estimators=estimators_rf_gpr), StackingRegressor(estimators=estimators_rf_gpr, final_estimator=RandomForestRegressor(random_state=random_state)), VotingRegressor(estimators=estimators_rf_xgb), StackingRegressor(estimators=estimators_rf_xgb, final_estimator=RandomForestRegressor(random_state=random_state)), ]

# Define graph titles self.titles = [ "Ground Truth", "Training Points", "SymbolicRegressor", "SVR", "GPR", "DecisionTree", "RForest", "XGBoost", "LGBM", "Vote_rf_svr", "Stack_rf_svr__rf", "Vote_rf_gpr", "Stack_rf_gpr__rf", "Vote_rf_xgb", "Stack_rf_xgb__rf", ]

def fit_models(self): """ Fit the models to the training data.

Returns: self: Instance of the class with fitted models """ if self.scaling: scaler_X = MinMaxScaler() self.X_train_scaled = scaler_X.fit_transform(self.X_train) else: self.X_train_scaled = self.X_train.copy()

for model in self.models: model.fit(self.X_train_scaled, self.y_train) return self

def visualize_surface(self, x0, x1, width=400, height=500, num_panel_columns=5, vertical_spacing=0.06, horizontal_spacing=0, output=None, display=False, return_fig=False): """ Visualize the prediction surface for each model.

Parameters: x0 (np.ndarray): Meshgrid for feature 1 x1 (np.ndarray): Meshgrid for feature 2 width (int): Width of the plot height (int): Height of the plot output (str): File path to save the plot display (bool): Flag to display the plot """

num_plots = len(self.models) + 2 num_panel_rows = num_plots // num_panel_columns

whole_width = width * num_panel_columns whole_height = height * num_panel_rows

specs = [[{'type': 'surface'} for _ in range(num_panel_columns)] for _ in range(num_panel_rows)] fig = make_subplots(rows=num_panel_rows, cols=num_panel_columns, specs=specs, subplot_titles=self.titles, vertical_spacing=vertical_spacing, horizontal_spacing=horizontal_spacing)

for i, model in enumerate([None, None] + self.models): # Assign the subplot panels row = i // num_panel_columns + 1 col = i % num_panel_columns + 1

# Plot training points if i == 1: fig.add_trace(go.Scatter3d(x=self.X_train[:, 0], y=self.X_train[:, 1], z=self.y_train, mode='markers', marker=dict(size=2, color='darkslategray'), name='Training Data'), row=row, col=col)

surface = go.Surface(z=self.y_truth, x=x0, y=x1, showscale=False, opacity=.4) fig.add_trace(surface, row=row, col=col)

# Plot predicted surface for each model and ground truth else: y_pred = self.y_truth if model is None else model.predict(np.c_[x0.ravel(), x1.ravel()]).reshape(x0.shape) surface = go.Surface(z=y_pred, x=x0, y=x1, showscale=False) fig.add_trace(surface, row=row, col=col)

fig.update_scenes(dict( xaxis_title='x0', yaxis_title='x1', zaxis_title='y', ), row=row, col=col)

fig.update_layout(title='Model Predictions and Ground Truth', width=whole_width, height=whole_height)

# Change camera angle camera = dict( up=dict(x=0, y=0, z=1), center=dict(x=0, y=0, z=0), eye=dict(x=-1.25, y=-1.25, z=2) ) for i in range(num_plots): fig.update_layout(**{f'scene{i+1}_camera': camera})

if display: fig.show()

if output: fig.write_html(output)

if return_fig: return fig

More here:

The Machine Learning Guide for Predictive Accuracy: Interpolation and Extrapolation - Towards Data Science

60 Power BI Interview Questions and Expert Answers for 2024 – Simplilearn

Back in 2011, the rise of Business Intelligence tools posed a challenge to Microsoft to build its own business intelligence tool. Microsoft introduced the Power BI to deliver compelling analytical capabilities to existing Microsoft Excel and upgrade it to be intelligent enough to generate interactive reports.

According to Gartner's Magic Quadrant, Microsoft Power BI is one of todays top business intelligence tools, chiefly because most IT firms rely on Power BI for their business analytics. As a result, the current IT industry finds a massive demand for Power BI Experts.

This tutorial is solely dedicated to helping aspiring Power BI professionals grasp the essential fundamentals of Power BI and crack the interviews in real-time. The tutorial is organized based on three categories, outlined below.

We have five dozen questions for you, so lets begin by going through some refresher-level or frequently asked beginner-level Power BI interview questions.

Power BI is a business analytics tool developed by Microsoft that helps you turn multiple unrelated data sources into valuable and interactive insights. These data may be in the form of an Excel spreadsheet or cloud-based/on-premises hybrid data warehouses. You can easily connect to all your data sources and share the insights with anyone.

Because Power BI provides an easy way for anyone, including non-technical people, to connect, change, and visualize their raw business data from many different sources and turn it into valuable data that makes it easy to make smart business decisions.

Both Tableau and Power BI are the current IT industry's data analytics and visualization giants. Yet, there are a few significant differences between them. You will now explore the important differences between Tableau and Power BI.

Tableau uses MDX for measures and dimensions

Power BI uses DAX for calculating measures

Tableau is capable of handling large volumes of data

Power BI is qualified only to handle a limited amount of data

Tableau is best suitable for experts

Power BI is suitable for both experts and beginners

Tableau User Interface is complicated

Power BI User Interface is comparatively simpler

Tableau is capable of supporting the cloud with ease.

Power BI finds it difficult, as its capacity to handle large volumes of data is limited.

The differences between Power Query and Power Pivot are explained as follows:

Power Query is all about analyzing data.

Power Pivot is all about getting and Transforming data.

Power Query is an ETL service tool.

Power Pivot is an in-memory data modeling component

Power BI Desktop is an open-source application designed and developed by Microsoft. Power BI Desktop will allow users to connect to, transform, and visualize your data with ease. Power BI Desktop lets users build visuals and collections of visuals that can be shared as reports with your colleagues or your clients in your organization.

Power Pivot is an add-on provided by Microsoft for Excel since 2010. Power Pivot was designed to extend the analytical capabilities and services of Microsoft Excel.

Power Query is a business intelligence tool designed by Microsoft for Excel. Power Query allows you to import data from various data sources and will enable you to clean, transform and reshape your data as per the requirements. Power Query allows you to write your query once and then run it with a simple refresh.

Self-service business intelligence (SSBI) is divided into the Excel BI Toolkit and Power BI.

SSBI is an abbreviation for Self-Service Business Intelligence and is a breakthrough in business intelligence. SSBI has enabled many business professionals with no technical or coding background to use Power BI and generate reports and draw predictions successfully. Even non-technical users can create these dashboards to help their business make more informed decisions.

DAX stands for Data Analysis Expressions. It's a collection of functions, operators, and constants used in formulas to calculate and return values. In other words, it helps you create new info from data you already have.

The term "Filter" is self-explanatory. Filters are mathematical and logical conditions applied to data to filter out essential information in rows and columns. The following are the variety of filters available in Power BI:

Custom Visuals are like any other visualizations, generated using Power BI. The only difference is that it developes the custom visuals using a custom SDK. The languages like JQuery and JavaScript are used to create custom visuals in Power BI.

Get Data is a simple icon on Power BI used to import data from the source.

Some of the advantages of using Power BI:

Here are some limitations to using Power BI:

Power Pivot for Excel supports only single directional relationships (one to many), calculated columns, and one import mode. Power BI Desktop supports bi-directional cross-filtering connections, security, calculated tables, and multiple import options.

There are three main connectivity modes used in Power BI.

An SQL Server Import is the default and most common connectivity type used in Power BI. It allows you to use the full capabilities of the Power BI Desktop.

The Direct Query connection type is only available when you connect to specific data sources. In this connectivity type, Power BI will only store the metadata of the underlying data and not the actual data.

With this connectivity type, it does not store data in the Power BI model. All interaction with a report using a Live Connection will directly query the existing Analysis Services model. There are only 3 data sources that support the live connection method - SQL Server Analysis Services (Tabular models and Multidimensional Cubes), Azure Analysis Services (Tabular Models), and Power BI Datasets hosted in the Power BI Service.

Four important types of refresh options provided in Microsoft Power BI are as follows:

Several data sources can be connected to Power BI, which is grouped into three main types:

It can import data from Excel (.xlsx, .xlxm), Power BI Desktop files (.pbix) and Comma-Separated Values (.csv).

These are a collection of related documents or files stored as a group. There are two types of content packs in Power BI:

Connectors help you connect your databases and datasets with apps, services, and data in the cloud.

A dashboard is a single-layer presentation sheet of multiple visualizations reports. The main features of the Power BI dashboard are:

Relationships between tables are defined in two ways:

No. There can be multiple inactive relationships, but only one active relationship between two tables in a Power Pivot data model. Dotted lines represent inactive relationships, and continuous lines represent active relationships.

Yes. There are two main reasons why you can have disconnected tables:

The CALCULATE function evaluates the sum of the Sales table Sales Amount column in a modified filter context. It is also the only function that allows users to modify the filter context of measures or tables.

Moving ahead, you will step up to the following Power BI Interview Questions from the Intermediate Level.

Most of the time, power BI gets assisted by the cloud to store the data. Power BI can use a desktop service. Microsoft Azure is used as the primary cloud service to store the data.

Row-level security limits the data a user can view and has access to, and it relies on filters. Users can define the rules and roles in Power BI Desktop and also publish them to Power BI Service to configure row-level security.

Users can use general formatting to make it easier for Power BI to categorize and identify data, making it considerably easier to work with.

There are three different views in Power BI, each of which serves another purpose:

Report View - In this view, users can add visualizations and additional report pages and publish the same on the portal.

Data View - In this view, data shaping can be performed using Query Editor tools.

Model View - In this view, users can manage relationships between complex datasets.

The important building blocks of Power BI are as follows:

Visualization is the process of generating charts and graphs for the representation of insights on business data.

A dataset is the collection of data used to create a visualization, such as a column of sales figures. Dataset can get combined and filtered from a variety of sources via built-in data plugins.

The final stage is the report stage. Here, there is a group of visualizations on one or more pages. For example, charts and maps are combined to make a final report.

A Power BI dashboard helps you to share a single visualization with colleagues and clients to view your final dashboard.

A tile is an individual visualization on your final dashboard or one of your charts in your final report.

The critical components of Power BI are mentioned below.

A content pack is defined as a ready-made collection of visualizations and Power BI reports using your chosen service. You'd use a content pack when you want to get up and running quickly instead of creating a report from scratch.

Bidirectional cross-filtering lets data modelers to decide how they want their Power BI Desktop filters to flow for data, using the relationships between tables. The filter context is transmitted to a second related table that exists on the other side of any given table relationship. This procedure helps data modelers solve the many-to-many issue without having to complicated DAX formulas. So, to sum it up, bidirectional cross-filtering makes the job for data modelers easier.

This is how the formula is writtenthat is, the elements that comprise it. The Syntax includes functions such as SUM (used when you want to add figures). If the Syntax isn't correct, you'll get an error message.

These are formulas that use specific values (also known as arguments) in a particular order to perform a calculation, similar to the functions in Excel. The categories of functions are date/time, time intelligence, information, logical, mathematical, statistical, text, parent/child, and others.

There are two types: row context and filter context. Row context comes into play whenever a formula has a function that applies filters to identify a single row in a table. When one or more filters are applied in a calculation that determines a result or value, the filter context comes into play.

You will use a custom visual file if the prepackaged files don't fit the needs of your business. Developers create custom visual files, and you can import them and use them in the same way as you would the prepackaged files.

A few familiar data sources are Excel, Power BI datasets, web, text, SQL server, and analysis services.

Power BI Desktop helps you to group the data in your visuals into chunks. You can, however, define your groups and bins. For grouping, use Ctrl + click to select multiple elements in the visual. Right-click one of those elements and, from the menu that appears, choose Group. In the Groups window, you can create new groups or modify existing ones.

On a Power BI final report page, a developer can resize a responsive slicer to various sizes and shapes, and the data collected in the container will be rearranged to find a match. If a visual report becomes too small to be useful, an icon representing the visual takes its place, saving space on the report page.

Query folding is used when steps defined in the Query Editor are translated into SQL and executed by the source database instead of your device. It helps with scalability and efficient processing.

M is a programming language used in Power Query as a functional, case-sensitive language similar to other programming languages and easy to use.

Visual-level filters are used to filter data within a single visualization. Page-level filters are used to work on an entire page in a report, and different pages can have various filters.

Report-level filters are used to filter all the visualizations and pages in the report.

Users can set up for an automatic refresh over data based on daily or weekly requirements. Users can schedule only one refresh maximum daily unless they have Power BI Pro. The Schedule Refresh section uses the pull-down menu choices to select a frequency, time zone, and time of day.

Power Map can display geographical visualizations. Therefore, some location data is neededfor example, city, state, country, or latitude and longitude.

Power Pivot uses the xVelocity engine. xVelocity can handle huge amounts of data, storing data in columnar databases. All data gets loaded into RAM memory when you use in-memory analytics, which boosts the processing speed.

Following are some of the important Components of SSAS:

An OLAP Engine is used to extensively run the ADHOC queries at a faster pace by the end-users

It describes data Drilling in SSAS as the process of exploring details of the data with multiple levels of granularity.

The data Slicing process in SSAS is defined as the process of storing the data in rows and columns.

Pivot Tables helps in switching between the different categories of data stored between rows and columns

Power BI is available mainly in three formats, as mentioned below.

There are three different stages in working on Power BI, as explained below.

The primary step in any business intelligence is to establish a successful connection with the data source and integrate it to extract data for processing.

The next step in business intelligence is data processing. Most of the time, the raw data also includes unexpected erroneous data, or sometimes a few data cells might be empty. The BI tool needs to interpret the missing values and inaccurate data for processing in the data processing stage.

The final stage in business intelligence is analyzing the data got from the source and presenting the insights using visually appealing graphs and interactive dashboards.

Beginners and experts prefer Power BI in business intelligence. Power BI is used mainly by the following professionals.

A business analyst is a professional who analyses the business data and represents the insights found using visually appealing graphs and dashboards

Business owners, decision-makers, or organizations use Power BI to view the insights and understand the prediction to make a business decision.

Business Developers are just software developers who get hired for business purposes to develop custom applications and dashboards to help the business process be smooth.

Advanced editor is used to view queries that Power BI is running against the data sources importing data. The query is rendered in M-code. Users wanting to view the query code select Edit Queries from the Home tab, then click on Advanced Editor to perform work on the query. Any changes get saved to Applied Steps in the Query Settings.

Gateways function as bridges between the in-house data sources and Azure Cloud Services.

There are multiple applications of Power BI; some of them are as follows:

Every individual chart or visualization report generated is collected and represented on a single screen. Such an approach is called a Power BI Dashboard. A Dashboard in Power BI is used to depict a story.

KPI is abbreviated as Key Performance Indicator. Any professional organization has teams and employees follow the KPI protocols. The organizations set up KPIs for all the employees. These KPIs act as their targets. These KPIs are compared to previous performance and analyze the progress.

Slicers are an integral part of a business report generated using Power BI. The functionality of a slicer can be considered similar to that of a filter, but, unlike a filter, a Slicer can display a visual representation of all values and users will be provided with the option to select from the available values in the slicers drop-down menu.

It is a combined solution offered to upload the reports and dashboards to the PowerBI.com website for reference. It consists of Power Pivot, Power Query, and Power Table.

Read this article:

60 Power BI Interview Questions and Expert Answers for 2024 - Simplilearn

Eco-Friendly AI: How to Reduce the Carbon and Water Footprints of Your ML Models – Towards Data Science

11 min read

As we push the boundaries of AI, especially with generative models, we are confronted with a pressing question that is forecasted to only become more urgent: What is the environmental cost of our progress? Training, hosting, and running these models arent just compute-intensive they require substantial natural resources, leading to significant carbon and water footprints that often fly under the radar. This discussion has become even more timely with Googles recent report on July 2, 2024, highlighting the challenges in meeting their ambitious climate goals. The report revealed a 13% increase in emissions in 2023 compared to the previous year and a 48% rise compared to their baseline year of 2019. The demand for AI has significantly strained data centers, a trend reflected in Microsofts environmental sustainability report from May, which noted a 29% increase in emissions above their 2020 baseline due to data center usage. Additionally, the International Energy Agency predicts that global data center and AI electricity demand could double by 2026, underscoring the urgent need for sustainable practices. For everyone

See the original post:

Eco-Friendly AI: How to Reduce the Carbon and Water Footprints of Your ML Models - Towards Data Science

Certifications That Can Boost Your Data Science Career in 2024 – KDnuggets

In today's data science landscape, how does one set themselves apart from the competition? Certifications in data science can greatly enhance your career by proving your skills and creating new job opportunities in 2024. Data science certificates help to gain knowledge about data science. They also help to validate our skills in this field. Lets take a look at some of the best certifications out there.

Candidates passing the Certified Analytics Professional certification demonstrate proficiency in analytics. The candidate needs a bachelors degree and five years of experience in analytics to write this exam. A master's degree and three years of experience or seven years of non-analytics experience are also sufficient. The exam includes topics such as data preparation and model building.

Cost: $375 for members, $575 for non-members Validity: 3 years

The Google Certified Professional Data Engineer certification proves one's skills in developing and designing Google Cloud Platform data processing systems. The candidates should have knowledge of data engineering and the practical usage of Google Cloud products. No formal qualifications are explicitly stated but applicants are expected to have at least three years of working experience in the relevant field. One of those years should be specifically dedicated to developing and implementing concepts related to Google Cloud.

Cost: $200 USD Validity: 2 years

The Azure Data Scientist Associate certification verifies abilities in implementing and operating machine learning solutions on Azure technologies. The exam is designed to check proficiency in ML, NLP and computer vision. It also shows how Azure's vast toolbox and services may lead to the development of scalable and successful machine learning systems. One should have enough knowledge and skills in data science concepts and proficiency in Python.

Cost: $165 Validity: 2 years

The SAS Certified Data Scientist certification shows that the candidate is able to use SAS for data analysis. In order to progress to the AI & Machine Learning Professional or Advanced Analytics certification, one has to clear the Advanced Programming and Data Curation Professional exams. This test measures the understanding and expertise of the candidate on various aspects related to the data analysis process. These skills include aggregation, transformation, and cleaning of data.

Cost: $180 Validity: 3 years

The topics covered in the IBM Data Science Professional Certificate online course equip learners with skills that would enable them to practice data science. The program participants can enroll in several online classes that can become useful as ones foundation in pursuing a career in data science. Beginner courses include overviews of data exploration and visualization. The program also includes modules focused on advanced areas within data science, such as data analysis with Python and data visualization with R.

Cost: $ 234 Validity: Credentials do not expire

The Cloudera Certified Professional Data Engineer certification consists of several aspects of data engineering within a big data environment. It equips candidates with the expertise needed to design robust and scalable data pipelines for large scale data processing. Candidates seeking this certification should possess an understanding of data engineering principles and be well versed in big data technologies.

Cost: $400 Validity: 2 years

The Senior Data Scientist certificate recognizes individuals who show proficiency in modeling machine learning and data science. The Senior Data Scientist certification comes with criteria that candidates need to fulfill to obtain the certification. A masters degree in a related field, including computer science, statistics, mathematics or related subjects, is one of these. Candidates must also have at least five years worth of work experience dealing with data science work.

Cost: $775 Validity: 3 years

It is highly essential to gain certification to enhance your data science career in 2024 since it will help to verify credibility, enhance the reputation of the worker, and open new opportunities for promotions. In that regard, these certifications will help you prove that you are a competent data science worker capable and willing to solve problems.

Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master's degree in Computer Science from the University of Liverpool.

Go here to see the original:

Certifications That Can Boost Your Data Science Career in 2024 - KDnuggets

Northwestern Mutual Data Science Institute: The promise of corporate-academic partnerships – University of Wisconsin-Milwaukee

Ethan Gu, a PhD student in computer science at Marquette University, talks with Onochie Fan-Osuala, an associate professor at UW-Whitewater, at a Northwestern Mutual Data Science Institute event. (Jaclyn Tyler photo)

In 2018, three influential local organizations Northwestern Mutual, Marquette University and the University of Wisconsin-Milwaukee united to form the Northwestern Mutual Data Science Institute (NMDSI) with a shared focus on the ever-evolving and dynamic domains of data science and artificial intelligence.

Today, the NMDSI remains committed to advancing research innovation, creating education pathways and driving community impact; our mission holds greater relevancy than ever before.

Northwestern Mutual, Marquette and UWM all recognize and contribute to Southeastern Wisconsins trajectory as a national hub for technology. With nearly $75 million in investments committed to date, NMDSI continues to push the boundaries for what is possible.

The institute also provides funding for an endowed professorship at each university, research projects/grants, student scholarships, new faculty recruitment, development of expanded curriculum and new degree programs, K-12 STEM and pre-college programming, includinginternships and certificates. Fostering increased collaboration and generating innovative project ideation across the institute and our community partners, our NMDSI-affiliated faculty have garnered more than $17 million in grants to date.

Last fall, as part of our five-year renewed investment, we developed a new Center of Excellence initiative through which we announced three innovative programs and engagement opportunities for our university partners. As a result, we awarded $500,000 in research dollars as part of the Paving ROADS Seed Fund Program, $100,000 on behalf of the Pioneer Curricula Program and $175,000 for the NMDSI Student Research Scholars Program. Moreover, the NMDSI also organizes numerous events that have attracted thousands of attendees since our inception. This is only the beginning of what is possible when industry collaborates with academia.

I am extremely proud of how far we have come over the past six years and am even more excited about our future. We will soon be rolling out our NMDSI Industry Affiliate Program, an opportunity to add additional partners who share in the mission of the NMDSI. In addition to our ongoing monthly speaker series, we have various upcoming engagement opportunities through the end of the year, including a Faculty AI Summit in partnership with the Higher Education Regional Alliance and a national conference focused on the ethics of AI. I encourage you to visit the NMDSI website to stay up to date on all our news/events, sign up for our newsletter, and follow us on LinkedIn.

Continue reading here:

Northwestern Mutual Data Science Institute: The promise of corporate-academic partnerships - University of Wisconsin-Milwaukee

Warren Kibbe, Ph.D. – Deputy Director for Data Science and Strategy – National Cancer Institute (.gov)

Warren Kibbe, Ph.D., Deputy Director for Data Science and Strategy

As NCI deputy director for data science and strategy, Warren A. Kibbe, Ph.D., FACMI, serves as the senior advisor to the NCI director for all matters related to data science. He provides strategic direction to the NCI Center for Bioinformatics and Information Technology (CBIIT) and manages and oversees all aspects of data science for the institute.

In this role, Dr. Kibbe provides strategic counsel on the development and implementation of key data science initiatives, including the NCI Childhood Cancer Data Initiative, the Cancer Research Data Commons, and the ARPA-H Biomedical Data Fabric Toolbox. He also serves as senior data science liaison to a variety of NIH and other government committees.

Under Dr. Kibbes leadership, NCI is applying new approaches to enhance NCIs data ecosystem, growing a diverse and talented data science workforce and building strategic partnerships to develop and disseminate advanced technologies and methods.

Until June 2024, Dr. Kibbe served as the chief for Translational Biomedical Informatics and vice chair of the Department of Biostatistics and Bioinformatics in the Duke University School of Medicine and as chief data officer for the Duke Cancer Institute. He also served as director of informatics for the Duke Clinical and Translational Science Institute.

Before joining Duke, Dr. Kibbe served as an acting deputy director of NCI from 2016 to 2017 and director of NCI CBIIT from 2013 to 2017. While at NCI, he enhanced the institutes digital capabilities, including biomedical informatics, scientific management information systems, and computing resources. He also helped establish the Genomic Data Commons and the NCI Cloud Pilots (now Cloud Resources), was instrumental in establishing NCIs partnership with the U.S. Department of Energy to advance precision oncology and scientific computing, and played a pivotal role in precision medicine and Cancer Moonshot activities, engaging both federal and private sector partners in cancer research.

Dr. Kibbes research interests include data representation for clinical trials and improving data interoperability between electronic health records and decision support algorithms. He has been a proponent for open science and open data in biomedical research and helped define the data sharing policy for the Cancer Moonshot. In 2018, Dr. Kibbe was elected a fellow of the American College of Medical Informatics.

Dr. Kibbe received his Ph.D. in Chemistry from the California Institute of Technology and completed his postdoctoral fellowship in molecular genetics at the Max Planck Institute in Gttingen, Germany. He received his B.S. in Chemistry from Michigan Technological University.

Read more:

Warren Kibbe, Ph.D. - Deputy Director for Data Science and Strategy - National Cancer Institute (.gov)

Stratos Idreos Appointed Faculty Co-Director of Harvard Data Science Initiative – Harvard School of Engineering and Applied Sciences

Stratos Idreos, the Gordon McKay Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS), has been named Faculty Co-Director of the Harvard Data Science Initiative (HDSI). Idreos will join Francesca Dominici, Faculty Director of the HDSI, in leading the Initiative.

Since its launch in 2017, the HDSI has united data science research and education efforts across the university. It has brought together leading computer scientists, statisticians, and experts from a range of disciplines including law, business, education, medicine and public health to advance data science in personalized health, public policy, scientific discovery and more.

David C. Parkes, the John A. Paulson Dean of SEAS, served as Faculty Co-Director of the Initiative from 2017 to 2023.

At SEAS, Idreos directs the Data Systems Laboratory. His lab is developing fast self-designing data engines that accelerate research and improve productivity across several data-intensive areas including data analytics, data science and artificial intelligence. Self-designing data engines shape themselves automatically to the data, hardware and application context to achieve the best possible performance.

Idreos joined SEAS in 2014. He completed his undergraduate studies and masters degree at Technical University of Crete in Greece, and his Ph.D. at the University of Amsterdam. He has received the United States Department of Energy Early Career Award, an NSF CAREER Grant, and multiple recognitions from the Association for Computing Machinery's Special Interest Group on Management of Data. In 2023, he was awarded the Capers W. McDonald and Marion K. McDonald Award for Excellence in Mentoring and Advising.

Continue reading here:

Stratos Idreos Appointed Faculty Co-Director of Harvard Data Science Initiative - Harvard School of Engineering and Applied Sciences

Introduction to Linear Programming Part II | by Robert Lohne | Jul, 2024 – Towards Data Science

Last year, I was approached by a friend who works in a small, family-owned steel and metal business. He wanted to know if it was possible to create something that would help him solve the problem of minimising waste when cutting steel beams. Sounds like a problem for linear programming!

When I started out, there was not a huge amount of beginners articles on how to use linear programming in R that made sense for somebody not that versed in math. Linear programming with R was also an area where ChatGPT did not shine in early 2023, so I found myself wishing for a guide.

This series is my attempt at making such a guide. Hopefully it will be of use to someone.

This is part II of the series, if you are looking for a introduction to linear programming in R, have a look at part I.

If you read the theory behind linear programming, or linear optimisation, you end up with a lot of math. This can be off-putting if you dont have a math background (I dont). For me, its mostly because I never took enough math classes to understand a lot of the symbols. Initially, this made understanding the tutorials surrounding linear programming harder than it should have been. However, you dont need to understand the math behind the theory to apply the principles of code in this article.

See the rest here:

Introduction to Linear Programming Part II | by Robert Lohne | Jul, 2024 - Towards Data Science