Category Archives: Data Science

Advancing drug discovery with AI: introducing the KEDD framework – EurekAlert

image:

A simple but effective feature fusion framework that jointly incorporates biomolecular structures, knowledge graphs, and biomedical texts for AI drug discovery.

Credit: [Yizhen Luo, Institute for AI Industry Research (AIR), Tsinghua University]

A transformative study published in Health Data Science, a Science Partner Journal, introduces a groundbreaking end-to-end deep learning framework, known as Knowledge-Empowered Drug Discovery (KEDD), aimed at revolutionizing the field of drug discovery. This innovative framework adeptly integrates structured and unstructured knowledge, enhancing the AI-driven exploration of molecular dynamics and interactions.

Traditionally, AI applications in drug discovery have been constrained by their focus on singular tasks, neglecting the rich tapestry of structured and unstructured data that could enrich their predictive accuracy. These limitations are particularly pronounced when dealing with novel compounds or proteins, where existing knowledge is scant or absent, often hampered by the prohibitive costs of manual data annotation.

Professor Zaiqing Nie, from Tsinghua University's Institute for AI Industry Research, emphasizes the enhancement potential of AI in drug discovery through KEDD. This framework synergizes data from molecular structures, knowledge graphs, and biomedical literature, offering a comprehensive approach that transcends the limitations of conventional models.

At its core, KEDD employs robust representation learning models to distill dense features from various data modalities. Following this, it integrates these features through a fusion process and leverages a predictive network to ascertain outcomes, facilitating its application across a spectrum of AI-facilitated drug discovery endeavors.

The study substantiates KEDD's effectiveness, showcasing its ability to outperform existing AI models in critical drug discovery tasks. Notably, KEDD demonstrates resilience in the face of the 'missing modality problem,' where lack of documented data on new drugs or proteins could undermine analytical processes. This resilience stems from its innovative use of sparse attention and modality masking techniques, which harness the power of existing knowledge bases to inform predictions and analyses.

Looking forward, Yizhen Luo, a key contributor to the KEDD project, outlines ambitious plans to enhance the framework's capabilities, including the exploration of multimodal pre-training strategies. The overarching objective is to cultivate a versatile, knowledge-driven AI ecosystem that accelerates biomedical research, delivering timely insights and recommendations to advance therapeutic discovery and development.

Health Data Science

Toward Unified AI Drug Discovery with Multimodal Knowledge

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

Read the original here:

Advancing drug discovery with AI: introducing the KEDD framework - EurekAlert

A Collection Of Free Data Science Courses From Harvard, Stanford, MIT, Cornell, and Berkeley – KDnuggets

Free courses are very popular on our platform, and we've received many requests from both beginners and professionals for more resources. To meet the demand of aspiring data scientists, we are providing a collection of free data science courses from the top universities in the world.

University professors and technical assistants teach these courses and cover topics such as math, probability, programming, databases, data analytics, data processing, data analysis, and machine learning. By the end of these courses, you'll have gained the skills required to master data science and become job-ready.

Link: 5 Free University Courses to Learn Computer Science

If you're considering switching to a career in data, it's crucial to learn computer science fundamentals. Many data science job applications include a coding interview section where you'll need to solve problems using a programming language of your choice.

This compilation offers some of the best free university courses to help you master foundations like computer hardware/software. You will learn Python, data structures and algorithms, as well as essential tools for software engineering.

Link: 5 Free University Courses to Learn Python

A curated list of five online courses offered by renowned universities like Harvard, MIT, Stanford, University of Michigan, and Carnegie Mellon University. These courses are designed to teach Python programming to beginners, covering fundamentals such as variables, control structures, data structures, file I/O, regular expressions, object-oriented programming, and computer science concepts like recursion, sorting algorithms, and computational limits.

Link: 5 Free University Courses to Learn Databases and SQL

It is a list of free database and SQL courses offered by renowned universities such as Cornell, Harvard, Stanford, and Carnegie Mellon University. These courses cover a wide range of topics, from the basics of SQL and relational databases to advanced concepts like NoSQL, NewSQL, database internals, data models, database design, distributed data processing, transaction processing, query optimization, and the inner workings of modern analytical data warehouses like Google BigQuery and Snowflake.

Link: 5 Free University Courses on Data Analytics

Compilation of online courses and resources available for individuals interested in pursuing data science, machine learning, and artificial intelligence. It highlights courses from prestigious institutions like Harvard, MIT, Stanford, Berkeley, covering topics such as Python for data science, statistical thinking, data analytics, mining massive data sets, and an introduction to artificial intelligence.

Link: 5 Free University Courses to Learn Data Science

A comprehensive list of free online courses from Harvard, MIT, and Stanford, designed to help individuals learn data science from the ground up. It begins with an introduction to Python programming and data science fundamentals, followed by courses covering computational thinking, statistical learning, and the mathematics behind data science concepts. The courses cover a wide range of topics, including programming, statistics, machine learning algorithms, dimensionality reduction techniques, clustering, and model evaluation.

Link: 9 Free Harvard Courses to Learn Data Science - KDnuggets

It outlines a data science learning roadmap consisting of 9 free courses offered by Harvard. It starts with learning programming basics in either R or Python, followed by courses on data visualization, probability, statistics, and productivity tools. It then covers data pre-processing techniques, linear regression, and machine learning concepts. The final step involves a capstone project that allows learners to apply the knowledge gained from the previous courses to a hands-on data science project.

Free online courses from top universities are an incredible resource for anyone looking to break into the field of data science or upgrade their current skills. This curated collection contains a list of courses that covers all the key areas - from core computer science and programming with Python, to databases and SQL, data analytics, machine learning, and full data science curricula. With courses taught by world-class professors, you can gain comprehensive knowledge and hands-on experience with the latest data science tools and techniques used in industry.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Read more from the original source:

A Collection Of Free Data Science Courses From Harvard, Stanford, MIT, Cornell, and Berkeley - KDnuggets

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 – Towards Data Science

MoEs also come with their own set of challenges, especially in terms of fine-tuning and memory requirements. The fine-tuning process can be difficult due to the models complexity, with the need to balance expert usage during training to properly train the gating weights to select the most relevant ones. In terms of memory, even though only a fraction of the total parameters are used during inference, the entire model, including all experts, needs to be loaded into memory, which requires high VRAM capacity.

More specifically, there are two essential parameters when it comes to MoEs:

Historically, MoEs have underperformed dense models. However, the release of Mixtral-8x7B in December 2023 shook things up and showed impressive performance for its size. Additionally, GPT-4 is also rumored to be an MoE, which would make sense as it would be a lot cheaper to run and train for OpenAI compared to a dense model. In addition to these recent excellent MoEs, we now have a new way of creating MoEs with MergeKit: frankenMoEs, also called MoErges.

The main difference between true MoEs and frankenMoEs is how theyre trained. In the case of true MoEs, the experts and the router are trained jointly. In the case of frankenMoEs, we upcycle existing models and initialize the router afterward.

In other words, we copy the weights of the layer norm and self-attention layers from a base model, and then copy the weights of the FFN layers found in each expert. This means that besides the FFNs, all the other parameters are shared. This explains why Mixtral-8x7B with eight experts doesnt have 8*7 = 56B parameters, but about 45B. This is also why using two experts per token gives the inference speed (FLOPs) of a 12B dense model instead of 14B.

FrankenMoEs are about selecting the most relevant experts and initializing them properly. MergeKit currently implements three ways of initializing the routers:

As you can guess, the hidden initialization is the most efficient to correctly route the tokens to the most relevant experts. In the next section, we will create our own frankenMoE using this technique.

To create our frankenMoE, we need to select n experts. In this case, we will rely on Mistral-7B thanks to its popularity and relatively small size. However, eight experts like in Mixtral is quite a lot, as we need to fit all of them in memory. For efficiency, I'll only use four experts in this example, with two of them engaged for each token and each layer. In this case, we will end up with a model with 24.2B parameters instead of 4*7 = 28B parameters.

Here, our goal is to create a well-rounded model that can do pretty much everything: write stories, explain articles, code in Python, etc. We can decompose this requirement into four tasks and select the best expert for each of them. This is how I decomposed it:

Now that weve identified the experts we want to use, we can create the YAML configuration that MergeKit will use to create our frankenMoE. This uses the mixtral branch of MergeKit. You can find more information about how to write the configuration on this page. Here is our version:

For each expert, I provide five basic positive prompts. You can be a bit fancier and write entire sentences if you want. The best strategy consists of using real prompts that should trigger a particular expert. You can also add negative prompts to do the opposite.

Once this is ready, you can save your configuration as config.yaml. In the same folder, we will download and install the mergekit library (mixtral branch).

If your computer has enough RAM (roughly 2432 GB of RAM), you can run the following command:

If you dont have enough RAM, you can shard the models instead as follows (it will take longer):

This command automatically downloads the experts and creates the frankenMoE in the merge directory. For the hidden gate mode, you can also use the --load-in-4bit and --load-in-8bit options to compute hidden states with lower precision.

Alternatively, you can copy your configuration into LazyMergekit, a wrapper I made to simplify model merging. In this Colab notebook, you can input your model name, select the mixtral branch, specify your Hugging Face username/token, and run the cells. After creating your frankenMoE, it will also upload it to the Hugging Face Hub with a nicely formatted model card.

I called my model Beyonder-4x7B-v3 and created GGUF versions of it using AutoGGUF. If you cant run GGUF versions on your local machine, you can also perform inference using this Colab notebook.

To get a good overview of its capabilities, it has been evaluated on three different benchmarks: Nous benchmark suite, EQ-Bench, and the Open LLM Leaderboard. This model is not designed to excel in traditional benchmarks, as the code and role-playing models generally do not apply to those contexts. Nonetheless, it performs remarkably well thanks to strong general-purpose experts.

Nous: Beyonder-4x7B-v3 is one of the best models on Nous benchmark suite (evaluation performed using LLM AutoEval) and significantly outperforms the v2. See the entire leaderboard here.

EQ-Bench: Its also the best 4x7B model on the EQ-Bench leaderboard, outperforming older versions of ChatGPT and Llama-270b-chat. Beyonder is very close to Mixtral-8x7B-Instruct-v0.1 and Gemini Pro, which are (supposedly) much bigger models.

Open LLM Leaderboard: Finally, its also a strong performer on the Open LLM Leaderboard, significantly outperforming the v2 model.

On top of these quantitative evaluations, I recommend checking the models outputs in a more qualitative way using a GGUF version on LM Studio. A common way of testing these models is to gather a private set of questions and check their outputs. With this strategy, I found that Beyonder-4x7B-v3 is quite robust to changes in the user and system prompts compared to other models, including AlphaMonarch-7B. This is pretty cool as it improves the usefulness of the model in general.

FrankenMoEs are a promising but still experimental approach. The trade-offs, like higher VRAM demand and slower inference speeds, can make it challenging to see their advantage over simpler merging techniques like SLERP or DARE TIES. Especially, when you use frankenMoEs with just two experts, they might not perform as well as if you had simply merged the two models. However, frankenMoEs excel in preserving knowledge, which can result in stronger models, as demonstrated by Beyonder-4x7B-v3. With the right hardware, these drawbacks can be effectively mitigated.

In this article, we introduced the Mixture of Experts architecture. Unlike traditional MoEs that are trained from scratch, MergeKit facilitates the creation of MoEs by ensembling experts, offering an innovative approach to improving model performance and efficiency. We detailed the process of creating a frankenMoE with MergeKit, highlighting the practical steps involved in selecting and combining different experts to produce a high-quality MoE.

Thanks for reading this article. I encourage you to try to make your own FrankenMoEs using LazyMergeKit: select a few models, create your config based Beyonders, and run the notebook to create your own models! If you liked this article, please follow me on Hugging Face and X/Twitter @maximelabonne.

Read more:

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 - Towards Data Science

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi – Towards Data Science

Imports

We start by importing dependencies and the synthetic dataset.

Next, well generate the text embeddings. Instead of using the OpenAI API, we will use an open-source model from the Sentence Transformers Python library. This model was specifically fine-tuned for semantic search.

To see the different resumes in the dataset and their relative locations in concept space, we can use PCA to reduce the dimensionality of the embedding vectors and visualize the data on a 2D plot (code is on GitHub).

From this view we see the resumes for a given role tend to clump together.

Now, to do a semantic search over these resumes, we can take a user query, translate it into a text embedding, and then return the nearest resumes in the embedding space. Heres what that looks like in code.

Printing the roles of the top 10 results, we see almost all are data engineers, which is a good sign.

Lets look at the resume of the top search results.

Although this is a made-up resume, the candidate likely has all the necessary skills and experience to fulfill the users needs.

Another way to look at the search results is via the 2D plot from before. Heres what that looks like for a few queries (see plot titles).

While this simple search example does a good job of matching particular candidates to a given query, it is not perfect. One shortcoming is when the user query includes a specific skill. For example, in the query Data Engineer with Apache Airflow experience, only 1 of the top 5 results have Airflow experience.

This highlights that semantic search is not better than keyword-based search in all situations. Each has its strengths and weaknesses.

Thus, a robust search system will employ so-called hybrid search, which combines the best of both techniques. While there are many ways to design such a system, a simple approach is applying keyword-based search to filter down results, followed by semantic search.

Two additional strategies for improving search are using a Reranker and fine-tuning text embeddings.

A Reranker is a model that directly compares two pieces of text. In other words, instead of computing the similarity between pieces of text via a distance metric in the embedding space, a Reranker computes such a similarity score directly.

Rerankers are commonly used to refine search results. For example, one can return the top 25 results using semantic search and then refine to the top 5 with a Reranker.

Fine-tuning text embeddings involves adapting an embedding model for a particular domain. This is a powerful approach because most embedding models are based on a broad collection of text and knowledge. Thus, they may not optimally organize concepts for a specific industry, e.g. data science and AI.

Although everyone seems focused on the potential for AI agents and assistants, recent innovations in text-embedding models have unlocked countless opportunities for simple yet high-value ML use cases.

Here, we reviewed two widely applicable use cases: text classification and semantic search. Text embeddings enable simpler and cheaper alternatives to LLM-based methods while still capturing much of the value.

More on LLMs

See the original post here:

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi - Towards Data Science

Transform Accelerator Announces Data Science and AI Startups Selected for Cohort 3 – Polsky Center for … – Polsky Center for Entrepreneurship and…

Published on Thursday, March 21, 2024

Following the success of its first and second cohorts, Transform adds seven new early-stage companies utilizing advances in data science and AI.

The University of Chicagos Polsky Center for Entrepreneurship and Innovation and Data Science Institute today announced the seven early-stage companies accepted into the third cohort of the Transform accelerator for data science and AI startups.

Powered by the Polsky Centers Deep Tech Ventures, Transform provides full-spectrum support for the startups accepted into the accelerator, including access to business and technical training, industry mentorship, venture capital connections, and funding opportunities.

The seven startups will receive approximately $250,000 in total investment, including $25,000 in funding, credits for Google for Startups, workspace in the Polsky Exchange on Chicagos South Side, and access to industry mentors, technical advisors and student talent from the University of Chicago Department of Computer Science, Data Science Institute (DSI), and the Chicago Booth School of Business.

Transform Cohort 3:

I am excited to welcome cohort three into Transform, this cycle was particularly competitive and we are delighted with the seven companies we selected, said Shyama Majumdar, director of Transform. We have a good mix of healthcare, construction, manufacturing, and fintech companies represented as we continue to see generative AI startups leading the way, which is reflected in this cohort. After the success of cohort 2, we are ready to run with cohort 3 and help pave their way to success.

The accelerator launched in Spring 2023 with its inaugural cohort and those startups are already seeing success. Echo Labs, a transcription platform in the previous cohort, has scaled up, hiring software engineers to meet the demand of partnerships with 150 universities to pilot their product. Blackcurrant, an online business-to-business marketplace for buying and selling hydrogen and member of the first cohort, recently was awarded a $250,000 co-investment from the George Shultz Innovation Fund after participating in the program.

The continued success of Transform startups has been very encouraging, said David Uminsky, executive director of the Data Science Institute. The wide range of sectors this new cohort serves demonstrates AIs increasing impact on business.

Transform is partly supported by corporate partners McAndrews, Held & Malloy, Ltd and venture partner, True Blue Partners, a Silicon Valley-based venture capital firm investing in early-stage AI companies, founded by Chicago Booth alum Sunil Grover, MBA 99.

Transform is providing the fertile ground necessary to help incubate the next generation of market leaders, said Grover, who also is a former engineer with nearly two decades of experience helping build companies as an entrepreneur, investor, and advisor. Advancements in deep tech present a unique interdisciplinary opportunity to re-imagine every aspect of the business world. This, I believe, will lead to creation of innovative new businesses that are re-imagined, ground up, to apply the capabilities these new technologies can enable.

Original post:

Transform Accelerator Announces Data Science and AI Startups Selected for Cohort 3 - Polsky Center for ... - Polsky Center for Entrepreneurship and...

Digital Transformation in Finance: Challenges and Benefits – Data Science Central

Digital transformation is no longer a choice, but a necessity for financial institutions looking to stay competitive in the ultramodern business world. From perfecting client experience to adding functional effectiveness and enhancing security, the benefits in finance are multitudinous. Still, with benefits come challenges and pitfalls that must be addressed to insure successful perpetration. In this article, we will discuss the advantages and challenges of digital transformation in finance sector, as well as successful exemplifications of companies that have delivered it to their advantage.

Digital transformation in finance is the process of implementing advanced digital technologies to boost financial processes, services, and client experiences. It involves the integration of technologies for example, as big data analytics, cloud computing, artificial intelligence, blockchain, and robotic process automation to automate and streamline financial operations. This process aims to enhance effectiveness, reduce costs, alleviate pitfalls, and give further individualized services to clients. By using digital technologies, financial institutions can gain a competitive advantage in the market and stay ahead of fleetly evolving client requirements and preferences.

The finance industry has been conventionally slow for borrowing new technologies, however the arrival of new technologies has made it significant for financial institutions for embracing transformation. Digital transformation enables financial institutions to offer substantiated services, reduce costs, increase effectiveness, alleviate pitfalls, and ameliorate client experiences. By embracing it, financial institutions can work data and analytics to make further informed opinions and enhance their operations. Also, digital transformation in finance can help financial institutions to stay ahead of the competition by enabling them to produce new products and services that feed to the evolving requirements of their clients. Thus, digital transformation is pivotal for financial institutions to stay applicable and thrive in todays competitive geography.

Digital transformation is reshaping the financial assiduity, furnishing multitudinous benefits to both financial institutions and their clients. In this section, we will explore some of its crucial benefits in finance, including enhanced client experience, increased effectiveness, bettered data analysis, enhanced security, and competitive advantage.

Digital transformation enhances client experience financial institutions can give substantiated services and ameliorate availability through different digital channels. This can drive towards increased client satisfaction and loyalty.

Digital transformation can help financial institutions automate and streamline different processes, leading to cost savings, faster reversal times, and bettered accuracy.

It enables financial institutions to work advanced analytics tools and algorithms to make further informed opinions and identify new business openings.

Digital transformation can ameliorate security by enforcing advanced cybersecurity measures for instance, as encryption, biometric authentication, and real time monitoring. This can cover financial institutions from cyber pitfalls and insure the safety of client data.

It can also give financial institutions with a competitive advantage by enabling them to produce new products and services that feed to the evolving requirements of their clients. Financial institutions that are adopting digital transformation are able to stay ahead of the competition and stay useful in todays digital era.

Digital transformation in finance is revolutionizing the financial sector, with a broad range of impacts affecting businesses and customers as well. From the dislocation of traditional business models to increased competition and higher personalization, the benefits and challenges of this transformation are far reaching. In this section, well explore the major ways in which digital transformation is impacting the financial assiduity.

Its disrupting traditional business models in the financial assiduity by creating new ways of delivering financial services, for example, as peer to peer lending, robo- advisory services, and mobile payments. As a result, traditional financial institutions are facing violent competition from digital-only startups and fintech companies that are more adaptable and agile.

Digital transformation has significantly increased competition in the financial assiduity, as clients now have access to a wider range of financial services and providers. This has forced traditional financial institutions to ameliorate their services, reduce costs, and introduce to remain competitive.

It has enabled financial institutions to automate and streamline different processes, performing in faster reversal times, reduced costs, and enhanced accuracy. For illustration, digital processes can help financial institutions handle client onboarding and loan processing more efficiently.

It has also enabled substantiated services grounded on client experiences and preferences, leading to increased client satisfaction and loyalty. By using data analytics, financial institutions can offer substantiated investment advice and customized product recommendations.

Digital transformation in finance has made financial services more accessible and accessible for clients, who can now pierce their accounts and conduct deals through multiple digital channels, for instance, as mobile apps, online apps, and chatbots.

It has also brought new security pitfalls to the financial assiduity, as financial deals and client data are highly exposed to cyber pitfalls. Financial institutions must apply robust security measures to cover themselves and their clients from implicit cyber attacks.

Digital transformation in finance isnt without its challenges and pitfalls. In this section, we will explore some of the common obstacles that financial institutions face when witnessing this process.

One of the common challenges in digital transformation is resistance to change from workers and clients. It isnt easy to introduce new technologies and processes, and some individualities may feel uncomfortable or hovered by the changes. Proper communication and training are necessary to insure a smooth transition.

The relinquishment of new technologies may bear the relief or integration of legacy systems and processes. These systems can be outdated and incompatible with ultramodern tools, which can produce obstacles and delays in digital transformation. Upgrading legacy systems and processes can be precious and time consuming, but its necessary to insure a smooth transition.

Digital transformation generates an enormous quantum of data, and managing that data can be a significant challenge for financial institutions. Data operation includes collecting, recycling, storing, and assaying data, which can be time consuming and bear significant resources. Effective dataoperation is essential to realize the full benefits of digital transformation.

This process introduces new cybersecurity pitfalls, including data breaches, phishing attacks, and ransomware. Financial institutions must take acceptable measures to cover themselves and their clients from these pitfalls. This includes enforcing strong cybersecurity programs, training workers on best practices, and investing in cybersecurity technologies.

Digital transformation has come a necessity for financial institutions to remain competitive in todays market. While there are challenges and pitfalls associated with digital transformation, the benefits are multitudinous, including enhanced client experience, increased effectiveness, and bettered data analysis. Successful exemplifications for example, as JPMorgan Chase, Ally Financial, Capital One, Goldman Sachs, and Mastercard show how digital transformation can lead to bettered business issues.

With the right strategy and perpetration approach, financial institutions can navigate the challenges and reap the prices of digital transformation. At Aeologic Technologies, we strive to give innovative solutions that enable financial institutions to achieve their digital transformation objectives and stay ahead of the wind.

Read this article:

Digital Transformation in Finance: Challenges and Benefits - Data Science Central

Cracking the Apache Spark Interview: 80+ Top Questions and Answers for 2024 – Simplilearn

Apache Spark is a unified analytics engine for processing large volumes of data. It can run workloads 100 times faster and offers over 80 high-level operators that make it easy to build parallel apps. Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, and can access data from multiple sources.

And this article covers the most important Apache Spark Interview questions that you might face in a Spark interview. The Spark interview questions have been segregated into different sections based on the various components of Apache Spark and surely after going through this article you will be able to answer most of the questions asked in your next Spark interview.

To learn more about Apache Spark interview questions, you can also watch the below video.

Apache Spark

MapReduce

Spark processes data in batches as well as in real-time

MapReduce processes data in batches only

Spark runs almost 100 times faster than Hadoop MapReduce

Hadoop MapReduce is slower when it comes to large scale data processing

Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it

Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve the data

Spark provides caching and in-memory data storage

Hadoop is highly disk-dependent

Apache Spark has 3 main categories that comprise its ecosystem. Those are:

This is one of the most frequently asked spark interview questions, and the interviewer will expect you to give a thorough answer to it.

Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.

Resilient Distributed Datasets are the fundamental data structure of Apache Spark. It is embedded in Spark Core. RDDs are immutable, fault-tolerant, distributed collections of objects that can be operated on in parallel.RDDs are split into partitions and can be executed on different nodes of a cluster.

RDDs are created by either transformation of existing RDDs or by loading an external dataset from stable storage like HDFS or HBase.

Here is how the architecture of RDD looks like:

So far, if you have any doubts regarding the apache spark interview questions and answers, please comment below.

When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.

Also Read: What Are the Skills Needed to Learn Hadoop?

Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.

To trigger the clean-ups, you need to set the parameter spark.cleaner.ttlx.

There are a total of 4 steps that can help you connect Spark to Apache Mesos.

Parquet is a columnar format that is supported by several data processing systems. With the Parquet file, Spark can perform both read and write operations.

Some of the advantages of having a Parquet file are:

Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The shuffle operation is implemented differently in Spark compared to Hadoop.

Shuffling has 2 important compression parameters:

spark.shuffle.compress checks whether the engine would compress shuffle outputs or not spark.shuffle.spill.compress decides whether to compress intermediate shuffle spill files or not

It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey

Spark uses a coalesce method to reduce the number of partitions in a DataFrame.

Suppose you want to read data from a CSV file into an RDD having four partitions.

This is how a filter operation is performed to remove all the multiple of 10 from the data.

The RDD has some empty partitions. It makes sense to reduce the number of partitions, which can be achieved by using coalesce.

This is how the resultant RDD would look like after applying to coalesce.

Consider the following cluster information:

Here is the number of core identification:

To calculate the number of executor identification:

Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:

There are 2 ways to convert a Spark RDD into a DataFrame:

import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB()

.where(field(first_name) === Peter)

.select(_id, first_name).toDF()

You can convert an RDD[Row] to a DataFrame by

calling createDataFrame on a SparkSession object

def createDataFrame(RDD, schema:StructType)

Resilient Distributed Dataset (RDD) is a rudimentary data structure of Spark. RDDs are the immutable Distributed collections of objects of any type. It records the data from various nodes and prevents it from significant faults.

The Resilient Distributed Dataset (RDD) in Spark supports two types of operations. These are:

The transformation function generates new RDD from the pre-existing RDDs in Spark. Whenever the transformation occurs, it generates a new RDD by taking an existing RDD as input and producing one or more RDD as output. Due to its Immutable nature, the input RDDs don't change and remain constant.

Along with this, if we apply Spark transformation, it builds RDD lineage, including all parent RDDs of the final RDDs. We can also call this RDD lineage as RDD operator graph or RDD dependency graph. RDD Transformation is the logically executed plan, which means it is a Directed Acyclic Graph (DAG) of the continuous parent RDDs of RDD.

The RDD Action works on an actual dataset by performing some specific actions. Whenever the action is triggered, the new RDD does not generate as happens in transformation. It depicts that Actions are Spark RDD operations that provide non-RDD values. The drivers and external storage systems store these non-RDD values of action. This brings all the RDDs into motion.

If appropriately defined, the action is how the data is sent from the Executor to the driver. Executors play the role of agents and the responsibility of executing a task. In comparison, the driver works as a JVM process facilitating the coordination of workers and task execution.

This is another frequently asked spark interview question. A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.

A Discretized Stream (DStream) is a continuous sequence of RDDs and the rudimentary abstraction in Spark Streaming. These RDDs sequences are of the same type representing a constant stream of data. Every RDD contains data from a specific interval.

The DStreams in Spark take input from many sources such as Kafka, Flume, Kinesis, or TCP sockets. It can also work as a data stream generated by converting the input stream. It facilitates developers with a high-level API and fault tolerance.

Caching also known as Persistence is an optimization technique for Spark computations. Similar to RDDs, DStreams also allow developers to persist the streams data in memory. That is, using the persist() method on a DStream will automatically persist every RDD of that DStream in memory. It helps to save interim partial results so they can be reused in subsequent stages.

The default persistence level is set to replicate the data to two nodes for fault-tolerance, and for input streams that receive data over the network.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark distributes broadcast variables using efficient broadcast algorithms to reduce communication costs.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

So far, if you have any doubts regarding the spark interview questions for beginners, please ask in the comment section below.

Moving forward, let us understand the spark interview questions for experienced candidates

DataFrame can be created programmatically with three steps:

1. map(func)

2. transform(func)

3. filter(func)

4. count()

The correct answer is c) filter(func).

This is one of the most frequently asked spark interview questions where the interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an answer as possible here.

Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.

Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

Controlling the transmission of data packets between multiple computer networks is done by the sliding window. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.

DISK_ONLY - Stores the RDD partitions only on the disk

MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array per partition

MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions wont be cached

OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory

MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk

MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk

See the rest here:

Cracking the Apache Spark Interview: 80+ Top Questions and Answers for 2024 - Simplilearn

Adding Temporal Resiliency to Data Science Applications | by Rohit Pandey | Mar, 2024 – Towards Data Science

Image by midjourney

Modern applications almost exclusively store their state in databases and also read any state they require to perform their tasks from databases. Well concern ourselves with adding resilience to the processes of reading from and writing to these databases, making them highly reliable.

The obvious way to do this is to improve the quality of the hardware and software comprising the database so our reads and writes never fail. But this becomes a law of diminishing returns where once were already at high availabilities, pouring more money in moves the needle only marginally. Adding redundancy to achieve high availability quickly becomes a much better strategy.

So, what does this high reliability via adding redundancy to the architecture look like? We remove single points of failure by spending more money on redundant systems. For example, maintaining redundant copies of the data so that if one copy gets corrupted or damaged, the others can be used to repair. Another example is having a redundant database which can be read from and written to when the primary one is unavailable. Well call these kinds of solutions where additional memory, disk space, hardware or other physical resources are allotted to ensure high availability spatial redundancy. But can we get high reliability (going beyond the characteristics of the underlying databases and other components) without spending any additional money? Thats where the idea of temporal redundancy comes in.

All images in this article unless otherwise specified are by the author.

If spatial redundancy is running with redundant infrastructure, then temporal redundancy is running more with existing infrastructure.

Temporal redundancy is typically much cheaper than spatial redundancy. It can also be easier to implement.

The idea is that when reliability compromising events happen to our applications and databases, they tend to be restricted to certain windows in time. If the

Read more from the original source:

Adding Temporal Resiliency to Data Science Applications | by Rohit Pandey | Mar, 2024 - Towards Data Science

FrugalGPT and Reducing LLM Operating Costs | by Matthew Gunton | Mar, 2024 – Towards Data Science

There are multiple ways to determine the cost of running a LLM (electricity use, compute cost, etc.), however, if you use a third-party LLM (a LLM-as-a-service) they typically charge you based on the tokens you use. Different vendors (OpenAI, Anthropic, Cohere, etc.) have different ways of counting the tokens, but for the sake of simplicity, well consider the cost to be based on the number of tokens processed by the LLM.

The most important part of this framework is the idea that different models cost different amounts. The authors of the paper conveniently assembled the below table highlighting the difference in cost, and the difference between them is significant. For example, AI21s output tokens cost an order of magnitude more than GPT-4s does in this table!

As a part of cost optimization we always need to figure out a way to optimize the answer quality while minimizing the cost. Typically, higher cost models are often higher performing models, able to give higher quality answers than lower cost ones. The general relationship can be seen in the below graph, with Frugal GPTs performance overlaid on top in red.

Using the vast cost difference between models, the researchers FrugalGPT system relies on a cascade of LLMs to give the user an answer. Put simply, the user query begins with the cheapest LLM, and if the answer is good enough, then it is returned. However, if the answer is not good enough, then the query is passed along to the next cheapest LLM.

The researchers used the following logic: if a less expensive model answers a question incorrectly, then it is likely that a more expensive model will give the answer correctly. Thus, to minimize costs the chain is ordered from least expensive to most expensive, assuming that quality goes up as you get more expensive.

This setup relies on reliably determining when an answer is good enough and when it isnt. To solve for this, the authors created a DistilBERT model that would take the question and answer then assign a score to the answer. As the DistilBERT model is exponentially smaller than the other models in the sequence, the cost to run it is almost negligible compared to the others.

One might naturally ask, if quality is most important, why not just query the best LLM and work on ways to reduce the cost of running the best LLM?

When this paper came out GPT-4 was the best LLM they found, yet GPT-4 did not always give a better answer than the FrugalGPT system! (Eagle-eyed readers will see this as part of the cost vs performance graph from before) The authors speculate that just as the most capable person doesnt always give the right answer, the most complex model wont either. Thus, by having the answer go through a filtering process with DistilBERT, you are removing any answers that arent up to par and increasing the odds of a good answer.

Consequently, this system not only reduces your costs but can also increase quality more so than just using the best LLM!

The results of this paper are fascinating to consider. For me, it raises questions about how we can go even further with cost savings without having to invest in further model optimization.

One such possibility is to cache all model answers in a vector database and then do a similarity search to determine if the answer in the cache works before starting the LLM cascade. This would significantly reduce costs by replacing a costly LLM operation with a comparatively less expensive query and similarity operation.

Additionally, it makes you wonder if outdated models can still be worth cost-optimizing, as if you can reduce their cost per token, they can still create value on the LLM cascade. Similarly, the key question here is at what point do you get diminishing returns by adding new LLMs onto the chain.

Read more:

FrugalGPT and Reducing LLM Operating Costs | by Matthew Gunton | Mar, 2024 - Towards Data Science

Claude 3 vs ChatGPT: Here is How to Find the Best in Data Science – DataDrivenInvestor

Claude 3 vs ChatGPT: The Ultimate AI & Data Science Duel Created with Abidin Dino AI, to reach it, consider being Paid subscriber to LearnAIWithMe, here

The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.

Edsger W. Dijkstra

Echoing Edsgers insight, we go into capabilities of two LLMs, contrasting their prowess in the arena of Data Science.

Here are the prompts well use to compare ;

Claude 3, developed by ex-OpenAI employees and supported by a $2 billion investment from Google in October, has quickly gained fame for its exceptional reasoning abilities.

Here you can use it : https://claude.ai/login?returnTo=%2F But lets first see how this compares with well-known LLMs.

Here are technical comparison of Clude 3 & GPT and Gemini(Old bard, googles LLM)

You can see the image below, which compares Claude 3 Sonet(Free), Claude 3 Opus, and Haiku(the Paid version of Claude 3) with Gemini 1.0 Ultra(the free version) and Gemini 1.0 Pro(the Paid version).

Here is the original post:

Claude 3 vs ChatGPT: Here is How to Find the Best in Data Science - DataDrivenInvestor