Page 326«..1020..325326327328..340350..»

SADA Achieves Over 300% Increase in Generative AI and Machine Learning Projects in 2023 – GlobeNewswire

LOS ANGELES, March 26, 2024 (GLOBE NEWSWIRE) -- SADA, An Insight company, a leading business and technology consultancy and award-winning Google Cloud Premier Partner across several product and engagement models, announces continued momentum powered by rapid scale in its Generative AI(GenAI) and machine learning operations as customers transform with Google Clouds Gemini and Vertex AI platform, propelling a significant increase in customer adoption of these GenAI technologies.

Our continued growth is a testament to SADAs commitment to delivering business outcomes and driving value to customers by providing industry-leading solutions and services, said Tony Safoian, CEO of SADA. We strive to be true business partners to our customers as they modernize, innovate, and grow their business with us and Google Cloud.

SADA Drives Customer Value

Generative AI & Machine Learning Scale

Solution & Services Innovation

Google Cloud Next 2024

SADAs team of experts will participate in a series of speaking sessions at Google Cloud Next in Las Vegas on April 9-11 alongside key customers and partners, sharing their cloud innovations and success.

SADAs Sessions include:

Additionally, SADAs ongoing cross-country Cloud Transformation Tour in North America brings Google Cloud experts to discuss machine learning and Gen AI for business insights.

Expanding Partner Ecosystem SADAs partnerships continued with a commitment to the ISV partner ecosystem, which allows SADA to deliver deep expertise and offer complementary solutions for Google Cloud and add-on solutions. Working together, SADAs partners help customers achieve maximum impact on their business goals.

About SADA SADA, An Insight company, is a market leader in professional services and an award-winning solutions provider of Google Cloud. Since 2000, SADA has been committed to helping customers in healthcare, media, entertainment, retail, manufacturing, and the public sector solve their most complex challenges so they can focus on achieving their boldest ambitions. With offices in North America, India, and Armenia providing sales and customer support teams, SADA is positioned to meet customers where they are in their digital transformation journey. SADA is a 6x Google Cloud Partner of the Year award winner with 10 Google Cloud Specializations and has been recognized as a Niche Player in the 2023 Gartner Magic Quadrant for Public Cloud IT Transformation Services. SADA is a 15x honoree of the Inc. 5000 list of Americas Fastest-Growing Private Companies and has been named to Inc. Magazines Best Workplaces four years in a row. Learn more at http://www.sada.com.

Read this article:
SADA Achieves Over 300% Increase in Generative AI and Machine Learning Projects in 2023 - GlobeNewswire

Read More..

How AI Bias Is Impacting Healthcare – InformationWeek

Artificial intelligence has been used to spot bias in healthcare, such as a lack of darker skin tones in dermatologic educational materials, but AI has been the cause of bias itself in some cases.

When AI bias occurs in healthcare, the causes are a mix of technical errors as well as real human decisions, according to Dr. Marshall Chin, professor of healthcare ethics in the Department of Medicine at the University of Chicago. Chin co-chaired a recent governmentpanelon AI bias.

This is something that we have control over, Chin tells InformationWeek. It's not just a technical thing that is inevitable.

In 2023, a class action lawsuit accused UnitedHealth of illegally using an AI algorithm to turn away seriously ill elderly patients from care under Medicare Advantage. The lawsuit blamed naviHealths nH Predict AI model for inaccuracy. UnitedHealth told StatNews last year that the naviHealth care-support tool is not used to make determinations. The lawsuit has no merit, and we will defend ourselves vigorously, the company stated.

Other cases of potential AI bias involved algorithms studying cases of heart failure, cardiac surgery, and vaginal birth after cesarean delivery (VBAC), in which an AI algorithm led Black patients to get more cesarean procedures than were necessary, according to Chin. The algorithm erroneously predicted that minorities were less likely to have success with a vaginal birth after a C-section compared with non-Hispanic white women, according to the US Department of Health and Human Services Office of Minority Health. It inappropriately had more of the racial minority patients having severe cesarean sections as opposed to having the vaginal birth, Chin explains. It basically led to an erroneous clinical decision that wasn't supported by the actual evidence base.

Related:Why AIs Slower Pace in Healthcare Is as It Should Be

After years of research, the VBAC algorithm was changed to no longer consider race or ethnicity when predicting which patients could suffer complications from a VBAC procedure, HHS reported.

When a dataset used to train an AI system lacks diversity, that can result in misdiagnoses, disparities in healthcare, and unequal insurance decisions on premiums or coverage," explains Tom Hittinger, healthcare applied AI leader at Deloitte Consulting.

If a dataset used to train an AI system lacks diversity, the AI may develop biased algorithms that perform well for certain demographic groups while failing others, Hittinger says in an email interview. This can exacerbate existing health inequities, leading to poor health outcomes for underrepresented groups.

Related:Metaverse: The Next Frontier in Healthcare?

Although AI tools can cause bias, they also bring more diversity to drug development. Companies such as BioPhy study patterns in patient populations to see how people respond to different types of drugs.

The challenge is to choose a patient population that is broad enough to offer a level of diversity but also bring drug efficacy. However, designing an AI algorithm to predict patient populations may result in only a subset of the population, explains Dave Latshaw II, PhD, cofounder ofBioPhy.

If you feed an algorithm that's designed to predict optimal patient populations with only a subset of the population, then it's going to give you an output that only recommends a subset of the population, Latshaw tells InformationWeek. You end up with bias in those predictions if you act on them when it comes to structuring your clinical trials and finding the right patients to participate.

Therefore, health IT leaders must diversify their training sets when teaching an AI platform to avoid blindness in the results, he adds.

The dream scenario for somebody who's developing a drug is that they're able to test their drug in nearly any person of any background from any location with any genetic makeup that has a particular disease, and it will work just the same in everyone, Latshaw says. That's the ideal state of the world.

Related:Connected Healthcare Takes Huge Leap Forward

IT leaders should involve a diverse group of stakeholders when implementing algorithms. That involves tech leaders, clinicians, patients, and the public, Chin says.

When validating AI models, IT leaders should include ethicists and data scientists along with clinicians, patients, and associates, which are nonclinical employees, staff members, and contractual workers at a healthcare organization, Hittinger says. When multiple teams roll out new models, that can increase the time required for experimentation and lead to a gradual rollout along with continuous monitoring, according to Hittinger.

That process can take many months, he says.

Many organizations are using proprietary algorithms, which lack an incentive to be transparent, according to Chin. He suggests that AI algorithms should have labels like a cereal box explaining how algorithms were developed, how patient demographic characteristics were distributed, and the analytical techniques used.

That would give people some sense of what this algorithm is, so this is not a total black box, Chin says.

In addition, organizations should audit and monitor AI systems for bias and performance disparities, Hittinger advises.

Organizations must proactively search for biases within their algorithms and datasets, undertake the necessary corrections, and set up mechanisms to prevent new biases from arising unexpectedly, Hittinger says. Upon detecting bias, it must be analyzed and then rectified through well-defined procedures aimed at addressing the issue and restoring public confidence.

Organizations such as Deloitte offer frameworks to provide guidance on how to maintain ethical use of AI.

One core tenet is creating fair, unbiased models and this means that AI needs to be developed and trained to adhere to equitable, uniform procedures and render impartial decisions, Hittinger says.

In addition, healthcare organizations can adopt automated monitoring tools to spot and fix model drift, according to Hittinger. He also suggests that healthcare organizations form partnerships with academic institutions and AI ethics firms.

Dr. Yair Lewis, chief medical officer at AI-powered primary-care platform Navina, recommends that organizations establish a fairness score metric for algorithms to ensure that patients are treated equally.

The concept is to analyze the algorithms performance across different demographics to identify any disparities, Lewis says in an email interview. By quantifying bias in this manner, organizations can set benchmarks for fairness and monitor improvements over time.

Original post:
How AI Bias Is Impacting Healthcare - InformationWeek

Read More..

Netflix Uses Metaflow to Manage Hundreds of AI/ML Applications at Scale – InfoQ.com

Netflix recently published how its Machine Learning Platform (MLP) team provides an ecosystem around Metaflow, an open-source machine learning infrastructure framework. By creating various integrations for Metaflow, Netflix is able to support hundreds of Metaflow projects maintained by multiple engineering teams.

Metaflow's integrations with Netflix's production systems enable projects to move from prototype to production without incurring unsustainable operational overhead. The engineering team explains their key to success:

Given the very diverse set of ML and AI use cases we support [...], we don't expect all projects to follow the same path from prototype to production. Instead, we provide a robust foundational layer with integrations to our company-wide data, compute, and orchestration platform, as well as various paths to deploy applications to production smoothly. On top of this, teams have built their own domain-specific libraries to support their specific use cases and needs.

One integration example provided is the "Fast Data" library for Metaflow. Netflix hosts its main data lake on S3 as Apache Iceberg tables and uses Apache Spark for ETL. The Fast Data library enables fast, scalable, and robust access to the Netflix data warehouse by leveraging high-performance components from the Python data ecosystem. This library allows Netflix toprocessterabytes of data collectively and encode complex relationships between titles, actors, and other film attributes, supporting the company's broad business applications.

The Fast Data Library for Metaflow (Source)

Netflix's production workflow orchestrator,Maestro, plays a critical role in managing Metaflow projects in production. It supports scalability and high availability and enables seamless integration of Metaflow flows with other systems through event-triggering. Using this integration, Netflix engineers can support content decision-making and answer "what content Netflix should bring to the service".

Finally, for deployments that require an API and real-time evaluation, Netflix provides an integrated model hosting service, Metaflow Hosting. Metaflow Hosting "provides an easy to use interface on top of Netflix's existing microservice infrastructure, allowing data scientists to quickly move their work from experimentation to a production grade web service that can be consumed over a HTTP REST API with minimal overhead."

Using this integration, Netflix hosts and scales a model to compute various media asset features. Once it finds the features, a consuming service can store them in a store for future use. An old talk provides an overview and further details about this service.

Hosting and Consuming the Media Feature Computation Model (Source)

Netflix implemented the integrations using Metaflow's extension mechanism, "which is publicly available but subject to change and hence not part of Metaflow's stable API yet." They invite engineers to contact them on the Metaflow community Slack to discuss building additional extensions.

Read more:
Netflix Uses Metaflow to Manage Hundreds of AI/ML Applications at Scale - InfoQ.com

Read More..

10 Areas Impacted By AI – SME.org

Few technologies have transformed the manufacturing industry like ERP software. The next transformative technologyartificial intelligence (AI) and machine learning (ML)is taking manufacturing to a new level with unprecedented predictive data tracking and analysis capabilities.

AI can be programmed to learn from large amounts of data to make deeper and more accurate predictions regarding customers, buying habits, inventory levels, markets, material purchasing and more. In turn, AI programs machines to learn from experience so they can perform tasks that have always been conducted by humans.

AI software uses progressive learning algorithms to automate repetitive learning and achieve incredible accuracy so the data can do the programming. It can even assist manufacturers with decision-making when the relevant data, parameters, and variables exceed human understanding.

Integrating AI and ERP software can do what ERP has been doing all alongsimplifying manufacturing, improving operational efficiency and increasing profitability while growing the company. Here are 10 ways AI can make your manufacturing better.

AI-integrated ERP software helps optimize inventory management by predicting demand, identifying slow-moving products, and automating order fulfillment. AI-based inventory planning provides increased visibility of inventory KPIs, improved product, channel and location forecasting, and automatic SKU classifying to meet material demands. A recent McKinsey study reported that companies utilizing AI to optimize inventory can reduce inventory levels from 20% to 50%.

AI-based inspection systems can identify production defects and anomalies in real time, reducing the risk of product recalls and improving overall quality. This includes using machine vision to identify defects on the assembly line that may not be visible to the human eye. AI-powered quality control software can create rules to determine the features that define acceptable quality levels.

AI can lead to better-informed pricing decisions by analyzing market trends, competitor pricing and customer behavior. By factoring these into pricing strategies you can predict how different prices will impact sales, and even combine experience and data to increase prices without damaging sales.

AI helps optimize production schedules, reduce lead times and avoid stockouts by predicting product demand based on historical data and trends. In addition to predicting consumer demand for your SKUs, AI can use real-time data to create forecasts based on current supply chain conditions.

AI can predict the types and quantities of products that will be in demand with remarkable accuracy, enabling you to reduce production lead times, lower costs, and increase customer satisfaction. AI algorithms can predict the expected quantities for products in demand, thereby reducing strains on specific links of your supply chain.

AI helps reduce repair costs and extends the life of production machines by predicting equipment failure and scheduling preventative maintenance before breakdowns occur. In addition, AI can improve worker safety by minimizing human errors and accidents while increasing efficiency and productivity.

AI-powered ERP software can assist in reducing labor costs by predicting employee productivity, identifying training needs, and optimizing scheduling. It can also lower insurance rates and medical costs by reducing workplace injuries via streamlining or automating risky processes.

AI-powered ERP software provides key performance indicators on production rates, inventory levels, and quality metrics to facilitate data-driven decisions and identify areas for improvement. AI speeds up the analytics process by preparing, analyzing and assessing data as soon as it is available.

AI can help minimize labor shortages through robotic automation, additive manufacturing, and machine vision. AI applications enable robot arms to safely handle objects on the production line and can even train robots to perform various types of assembly line work done by humans.

AIs ability to automate production processes improves efficiency by reducing the need for human intervention. Machine automation can perform a wide range of production processes from repetitive tasks such as data entry and order processing to complex tasks like spotting anomalies on the production line.

The manufacturing industry has begun to gravitate toward big data in large part due to its ability to produce predictive forecasts regarding sales, pricing, material availability, and other key metrics. Using advanced technology, including AI, big data pieces together very large and diverse datasets that are used in machine learning, predictive modeling, and other advanced analytics to solve business problems and make informed decisions.

The complexity of AI algorithms can be daunting. Yet, their ability to look to the past, present and future will help modernize manufacturing.

Follow this link:
10 Areas Impacted By AI - SME.org

Read More..

A Collection Of Free Data Science Courses From Harvard, Stanford, MIT, Cornell, and Berkeley – KDnuggets

Free courses are very popular on our platform, and we've received many requests from both beginners and professionals for more resources. To meet the demand of aspiring data scientists, we are providing a collection of free data science courses from the top universities in the world.

University professors and technical assistants teach these courses and cover topics such as math, probability, programming, databases, data analytics, data processing, data analysis, and machine learning. By the end of these courses, you'll have gained the skills required to master data science and become job-ready.

Link: 5 Free University Courses to Learn Computer Science

If you're considering switching to a career in data, it's crucial to learn computer science fundamentals. Many data science job applications include a coding interview section where you'll need to solve problems using a programming language of your choice.

This compilation offers some of the best free university courses to help you master foundations like computer hardware/software. You will learn Python, data structures and algorithms, as well as essential tools for software engineering.

Link: 5 Free University Courses to Learn Python

A curated list of five online courses offered by renowned universities like Harvard, MIT, Stanford, University of Michigan, and Carnegie Mellon University. These courses are designed to teach Python programming to beginners, covering fundamentals such as variables, control structures, data structures, file I/O, regular expressions, object-oriented programming, and computer science concepts like recursion, sorting algorithms, and computational limits.

Link: 5 Free University Courses to Learn Databases and SQL

It is a list of free database and SQL courses offered by renowned universities such as Cornell, Harvard, Stanford, and Carnegie Mellon University. These courses cover a wide range of topics, from the basics of SQL and relational databases to advanced concepts like NoSQL, NewSQL, database internals, data models, database design, distributed data processing, transaction processing, query optimization, and the inner workings of modern analytical data warehouses like Google BigQuery and Snowflake.

Link: 5 Free University Courses on Data Analytics

Compilation of online courses and resources available for individuals interested in pursuing data science, machine learning, and artificial intelligence. It highlights courses from prestigious institutions like Harvard, MIT, Stanford, Berkeley, covering topics such as Python for data science, statistical thinking, data analytics, mining massive data sets, and an introduction to artificial intelligence.

Link: 5 Free University Courses to Learn Data Science

A comprehensive list of free online courses from Harvard, MIT, and Stanford, designed to help individuals learn data science from the ground up. It begins with an introduction to Python programming and data science fundamentals, followed by courses covering computational thinking, statistical learning, and the mathematics behind data science concepts. The courses cover a wide range of topics, including programming, statistics, machine learning algorithms, dimensionality reduction techniques, clustering, and model evaluation.

Link: 9 Free Harvard Courses to Learn Data Science - KDnuggets

It outlines a data science learning roadmap consisting of 9 free courses offered by Harvard. It starts with learning programming basics in either R or Python, followed by courses on data visualization, probability, statistics, and productivity tools. It then covers data pre-processing techniques, linear regression, and machine learning concepts. The final step involves a capstone project that allows learners to apply the knowledge gained from the previous courses to a hands-on data science project.

Free online courses from top universities are an incredible resource for anyone looking to break into the field of data science or upgrade their current skills. This curated collection contains a list of courses that covers all the key areas - from core computer science and programming with Python, to databases and SQL, data analytics, machine learning, and full data science curricula. With courses taught by world-class professors, you can gain comprehensive knowledge and hands-on experience with the latest data science tools and techniques used in industry.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Read more from the original source:

A Collection Of Free Data Science Courses From Harvard, Stanford, MIT, Cornell, and Berkeley - KDnuggets

Read More..

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 – Towards Data Science

MoEs also come with their own set of challenges, especially in terms of fine-tuning and memory requirements. The fine-tuning process can be difficult due to the models complexity, with the need to balance expert usage during training to properly train the gating weights to select the most relevant ones. In terms of memory, even though only a fraction of the total parameters are used during inference, the entire model, including all experts, needs to be loaded into memory, which requires high VRAM capacity.

More specifically, there are two essential parameters when it comes to MoEs:

Historically, MoEs have underperformed dense models. However, the release of Mixtral-8x7B in December 2023 shook things up and showed impressive performance for its size. Additionally, GPT-4 is also rumored to be an MoE, which would make sense as it would be a lot cheaper to run and train for OpenAI compared to a dense model. In addition to these recent excellent MoEs, we now have a new way of creating MoEs with MergeKit: frankenMoEs, also called MoErges.

The main difference between true MoEs and frankenMoEs is how theyre trained. In the case of true MoEs, the experts and the router are trained jointly. In the case of frankenMoEs, we upcycle existing models and initialize the router afterward.

In other words, we copy the weights of the layer norm and self-attention layers from a base model, and then copy the weights of the FFN layers found in each expert. This means that besides the FFNs, all the other parameters are shared. This explains why Mixtral-8x7B with eight experts doesnt have 8*7 = 56B parameters, but about 45B. This is also why using two experts per token gives the inference speed (FLOPs) of a 12B dense model instead of 14B.

FrankenMoEs are about selecting the most relevant experts and initializing them properly. MergeKit currently implements three ways of initializing the routers:

As you can guess, the hidden initialization is the most efficient to correctly route the tokens to the most relevant experts. In the next section, we will create our own frankenMoE using this technique.

To create our frankenMoE, we need to select n experts. In this case, we will rely on Mistral-7B thanks to its popularity and relatively small size. However, eight experts like in Mixtral is quite a lot, as we need to fit all of them in memory. For efficiency, I'll only use four experts in this example, with two of them engaged for each token and each layer. In this case, we will end up with a model with 24.2B parameters instead of 4*7 = 28B parameters.

Here, our goal is to create a well-rounded model that can do pretty much everything: write stories, explain articles, code in Python, etc. We can decompose this requirement into four tasks and select the best expert for each of them. This is how I decomposed it:

Now that weve identified the experts we want to use, we can create the YAML configuration that MergeKit will use to create our frankenMoE. This uses the mixtral branch of MergeKit. You can find more information about how to write the configuration on this page. Here is our version:

For each expert, I provide five basic positive prompts. You can be a bit fancier and write entire sentences if you want. The best strategy consists of using real prompts that should trigger a particular expert. You can also add negative prompts to do the opposite.

Once this is ready, you can save your configuration as config.yaml. In the same folder, we will download and install the mergekit library (mixtral branch).

If your computer has enough RAM (roughly 2432 GB of RAM), you can run the following command:

If you dont have enough RAM, you can shard the models instead as follows (it will take longer):

This command automatically downloads the experts and creates the frankenMoE in the merge directory. For the hidden gate mode, you can also use the --load-in-4bit and --load-in-8bit options to compute hidden states with lower precision.

Alternatively, you can copy your configuration into LazyMergekit, a wrapper I made to simplify model merging. In this Colab notebook, you can input your model name, select the mixtral branch, specify your Hugging Face username/token, and run the cells. After creating your frankenMoE, it will also upload it to the Hugging Face Hub with a nicely formatted model card.

I called my model Beyonder-4x7B-v3 and created GGUF versions of it using AutoGGUF. If you cant run GGUF versions on your local machine, you can also perform inference using this Colab notebook.

To get a good overview of its capabilities, it has been evaluated on three different benchmarks: Nous benchmark suite, EQ-Bench, and the Open LLM Leaderboard. This model is not designed to excel in traditional benchmarks, as the code and role-playing models generally do not apply to those contexts. Nonetheless, it performs remarkably well thanks to strong general-purpose experts.

Nous: Beyonder-4x7B-v3 is one of the best models on Nous benchmark suite (evaluation performed using LLM AutoEval) and significantly outperforms the v2. See the entire leaderboard here.

EQ-Bench: Its also the best 4x7B model on the EQ-Bench leaderboard, outperforming older versions of ChatGPT and Llama-270b-chat. Beyonder is very close to Mixtral-8x7B-Instruct-v0.1 and Gemini Pro, which are (supposedly) much bigger models.

Open LLM Leaderboard: Finally, its also a strong performer on the Open LLM Leaderboard, significantly outperforming the v2 model.

On top of these quantitative evaluations, I recommend checking the models outputs in a more qualitative way using a GGUF version on LM Studio. A common way of testing these models is to gather a private set of questions and check their outputs. With this strategy, I found that Beyonder-4x7B-v3 is quite robust to changes in the user and system prompts compared to other models, including AlphaMonarch-7B. This is pretty cool as it improves the usefulness of the model in general.

FrankenMoEs are a promising but still experimental approach. The trade-offs, like higher VRAM demand and slower inference speeds, can make it challenging to see their advantage over simpler merging techniques like SLERP or DARE TIES. Especially, when you use frankenMoEs with just two experts, they might not perform as well as if you had simply merged the two models. However, frankenMoEs excel in preserving knowledge, which can result in stronger models, as demonstrated by Beyonder-4x7B-v3. With the right hardware, these drawbacks can be effectively mitigated.

In this article, we introduced the Mixture of Experts architecture. Unlike traditional MoEs that are trained from scratch, MergeKit facilitates the creation of MoEs by ensembling experts, offering an innovative approach to improving model performance and efficiency. We detailed the process of creating a frankenMoE with MergeKit, highlighting the practical steps involved in selecting and combining different experts to produce a high-quality MoE.

Thanks for reading this article. I encourage you to try to make your own FrankenMoEs using LazyMergeKit: select a few models, create your config based Beyonders, and run the notebook to create your own models! If you liked this article, please follow me on Hugging Face and X/Twitter @maximelabonne.

Read more:

Create Mixtures of Experts with MergeKit | by Maxime Labonne | Mar, 2024 - Towards Data Science

Read More..

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi – Towards Data Science

Imports

We start by importing dependencies and the synthetic dataset.

Next, well generate the text embeddings. Instead of using the OpenAI API, we will use an open-source model from the Sentence Transformers Python library. This model was specifically fine-tuned for semantic search.

To see the different resumes in the dataset and their relative locations in concept space, we can use PCA to reduce the dimensionality of the embedding vectors and visualize the data on a 2D plot (code is on GitHub).

From this view we see the resumes for a given role tend to clump together.

Now, to do a semantic search over these resumes, we can take a user query, translate it into a text embedding, and then return the nearest resumes in the embedding space. Heres what that looks like in code.

Printing the roles of the top 10 results, we see almost all are data engineers, which is a good sign.

Lets look at the resume of the top search results.

Although this is a made-up resume, the candidate likely has all the necessary skills and experience to fulfill the users needs.

Another way to look at the search results is via the 2D plot from before. Heres what that looks like for a few queries (see plot titles).

While this simple search example does a good job of matching particular candidates to a given query, it is not perfect. One shortcoming is when the user query includes a specific skill. For example, in the query Data Engineer with Apache Airflow experience, only 1 of the top 5 results have Airflow experience.

This highlights that semantic search is not better than keyword-based search in all situations. Each has its strengths and weaknesses.

Thus, a robust search system will employ so-called hybrid search, which combines the best of both techniques. While there are many ways to design such a system, a simple approach is applying keyword-based search to filter down results, followed by semantic search.

Two additional strategies for improving search are using a Reranker and fine-tuning text embeddings.

A Reranker is a model that directly compares two pieces of text. In other words, instead of computing the similarity between pieces of text via a distance metric in the embedding space, a Reranker computes such a similarity score directly.

Rerankers are commonly used to refine search results. For example, one can return the top 25 results using semantic search and then refine to the top 5 with a Reranker.

Fine-tuning text embeddings involves adapting an embedding model for a particular domain. This is a powerful approach because most embedding models are based on a broad collection of text and knowledge. Thus, they may not optimally organize concepts for a specific industry, e.g. data science and AI.

Although everyone seems focused on the potential for AI agents and assistants, recent innovations in text-embedding models have unlocked countless opportunities for simple yet high-value ML use cases.

Here, we reviewed two widely applicable use cases: text classification and semantic search. Text embeddings enable simpler and cheaper alternatives to LLM-based methods while still capturing much of the value.

More on LLMs

See the original post here:

Text Embeddings, Classification, and Semantic Search | by Shaw Talebi - Towards Data Science

Read More..

Transform Accelerator Announces Data Science and AI Startups Selected for Cohort 3 – Polsky Center for … – Polsky Center for Entrepreneurship and…

Published on Thursday, March 21, 2024

Following the success of its first and second cohorts, Transform adds seven new early-stage companies utilizing advances in data science and AI.

The University of Chicagos Polsky Center for Entrepreneurship and Innovation and Data Science Institute today announced the seven early-stage companies accepted into the third cohort of the Transform accelerator for data science and AI startups.

Powered by the Polsky Centers Deep Tech Ventures, Transform provides full-spectrum support for the startups accepted into the accelerator, including access to business and technical training, industry mentorship, venture capital connections, and funding opportunities.

The seven startups will receive approximately $250,000 in total investment, including $25,000 in funding, credits for Google for Startups, workspace in the Polsky Exchange on Chicagos South Side, and access to industry mentors, technical advisors and student talent from the University of Chicago Department of Computer Science, Data Science Institute (DSI), and the Chicago Booth School of Business.

Transform Cohort 3:

I am excited to welcome cohort three into Transform, this cycle was particularly competitive and we are delighted with the seven companies we selected, said Shyama Majumdar, director of Transform. We have a good mix of healthcare, construction, manufacturing, and fintech companies represented as we continue to see generative AI startups leading the way, which is reflected in this cohort. After the success of cohort 2, we are ready to run with cohort 3 and help pave their way to success.

The accelerator launched in Spring 2023 with its inaugural cohort and those startups are already seeing success. Echo Labs, a transcription platform in the previous cohort, has scaled up, hiring software engineers to meet the demand of partnerships with 150 universities to pilot their product. Blackcurrant, an online business-to-business marketplace for buying and selling hydrogen and member of the first cohort, recently was awarded a $250,000 co-investment from the George Shultz Innovation Fund after participating in the program.

The continued success of Transform startups has been very encouraging, said David Uminsky, executive director of the Data Science Institute. The wide range of sectors this new cohort serves demonstrates AIs increasing impact on business.

Transform is partly supported by corporate partners McAndrews, Held & Malloy, Ltd and venture partner, True Blue Partners, a Silicon Valley-based venture capital firm investing in early-stage AI companies, founded by Chicago Booth alum Sunil Grover, MBA 99.

Transform is providing the fertile ground necessary to help incubate the next generation of market leaders, said Grover, who also is a former engineer with nearly two decades of experience helping build companies as an entrepreneur, investor, and advisor. Advancements in deep tech present a unique interdisciplinary opportunity to re-imagine every aspect of the business world. This, I believe, will lead to creation of innovative new businesses that are re-imagined, ground up, to apply the capabilities these new technologies can enable.

Original post:

Transform Accelerator Announces Data Science and AI Startups Selected for Cohort 3 - Polsky Center for ... - Polsky Center for Entrepreneurship and...

Read More..

Digital Transformation in Finance: Challenges and Benefits – Data Science Central

Digital transformation is no longer a choice, but a necessity for financial institutions looking to stay competitive in the ultramodern business world. From perfecting client experience to adding functional effectiveness and enhancing security, the benefits in finance are multitudinous. Still, with benefits come challenges and pitfalls that must be addressed to insure successful perpetration. In this article, we will discuss the advantages and challenges of digital transformation in finance sector, as well as successful exemplifications of companies that have delivered it to their advantage.

Digital transformation in finance is the process of implementing advanced digital technologies to boost financial processes, services, and client experiences. It involves the integration of technologies for example, as big data analytics, cloud computing, artificial intelligence, blockchain, and robotic process automation to automate and streamline financial operations. This process aims to enhance effectiveness, reduce costs, alleviate pitfalls, and give further individualized services to clients. By using digital technologies, financial institutions can gain a competitive advantage in the market and stay ahead of fleetly evolving client requirements and preferences.

The finance industry has been conventionally slow for borrowing new technologies, however the arrival of new technologies has made it significant for financial institutions for embracing transformation. Digital transformation enables financial institutions to offer substantiated services, reduce costs, increase effectiveness, alleviate pitfalls, and ameliorate client experiences. By embracing it, financial institutions can work data and analytics to make further informed opinions and enhance their operations. Also, digital transformation in finance can help financial institutions to stay ahead of the competition by enabling them to produce new products and services that feed to the evolving requirements of their clients. Thus, digital transformation is pivotal for financial institutions to stay applicable and thrive in todays competitive geography.

Digital transformation is reshaping the financial assiduity, furnishing multitudinous benefits to both financial institutions and their clients. In this section, we will explore some of its crucial benefits in finance, including enhanced client experience, increased effectiveness, bettered data analysis, enhanced security, and competitive advantage.

Digital transformation enhances client experience financial institutions can give substantiated services and ameliorate availability through different digital channels. This can drive towards increased client satisfaction and loyalty.

Digital transformation can help financial institutions automate and streamline different processes, leading to cost savings, faster reversal times, and bettered accuracy.

It enables financial institutions to work advanced analytics tools and algorithms to make further informed opinions and identify new business openings.

Digital transformation can ameliorate security by enforcing advanced cybersecurity measures for instance, as encryption, biometric authentication, and real time monitoring. This can cover financial institutions from cyber pitfalls and insure the safety of client data.

It can also give financial institutions with a competitive advantage by enabling them to produce new products and services that feed to the evolving requirements of their clients. Financial institutions that are adopting digital transformation are able to stay ahead of the competition and stay useful in todays digital era.

Digital transformation in finance is revolutionizing the financial sector, with a broad range of impacts affecting businesses and customers as well. From the dislocation of traditional business models to increased competition and higher personalization, the benefits and challenges of this transformation are far reaching. In this section, well explore the major ways in which digital transformation is impacting the financial assiduity.

Its disrupting traditional business models in the financial assiduity by creating new ways of delivering financial services, for example, as peer to peer lending, robo- advisory services, and mobile payments. As a result, traditional financial institutions are facing violent competition from digital-only startups and fintech companies that are more adaptable and agile.

Digital transformation has significantly increased competition in the financial assiduity, as clients now have access to a wider range of financial services and providers. This has forced traditional financial institutions to ameliorate their services, reduce costs, and introduce to remain competitive.

It has enabled financial institutions to automate and streamline different processes, performing in faster reversal times, reduced costs, and enhanced accuracy. For illustration, digital processes can help financial institutions handle client onboarding and loan processing more efficiently.

It has also enabled substantiated services grounded on client experiences and preferences, leading to increased client satisfaction and loyalty. By using data analytics, financial institutions can offer substantiated investment advice and customized product recommendations.

Digital transformation in finance has made financial services more accessible and accessible for clients, who can now pierce their accounts and conduct deals through multiple digital channels, for instance, as mobile apps, online apps, and chatbots.

It has also brought new security pitfalls to the financial assiduity, as financial deals and client data are highly exposed to cyber pitfalls. Financial institutions must apply robust security measures to cover themselves and their clients from implicit cyber attacks.

Digital transformation in finance isnt without its challenges and pitfalls. In this section, we will explore some of the common obstacles that financial institutions face when witnessing this process.

One of the common challenges in digital transformation is resistance to change from workers and clients. It isnt easy to introduce new technologies and processes, and some individualities may feel uncomfortable or hovered by the changes. Proper communication and training are necessary to insure a smooth transition.

The relinquishment of new technologies may bear the relief or integration of legacy systems and processes. These systems can be outdated and incompatible with ultramodern tools, which can produce obstacles and delays in digital transformation. Upgrading legacy systems and processes can be precious and time consuming, but its necessary to insure a smooth transition.

Digital transformation generates an enormous quantum of data, and managing that data can be a significant challenge for financial institutions. Data operation includes collecting, recycling, storing, and assaying data, which can be time consuming and bear significant resources. Effective dataoperation is essential to realize the full benefits of digital transformation.

This process introduces new cybersecurity pitfalls, including data breaches, phishing attacks, and ransomware. Financial institutions must take acceptable measures to cover themselves and their clients from these pitfalls. This includes enforcing strong cybersecurity programs, training workers on best practices, and investing in cybersecurity technologies.

Digital transformation has come a necessity for financial institutions to remain competitive in todays market. While there are challenges and pitfalls associated with digital transformation, the benefits are multitudinous, including enhanced client experience, increased effectiveness, and bettered data analysis. Successful exemplifications for example, as JPMorgan Chase, Ally Financial, Capital One, Goldman Sachs, and Mastercard show how digital transformation can lead to bettered business issues.

With the right strategy and perpetration approach, financial institutions can navigate the challenges and reap the prices of digital transformation. At Aeologic Technologies, we strive to give innovative solutions that enable financial institutions to achieve their digital transformation objectives and stay ahead of the wind.

Read this article:

Digital Transformation in Finance: Challenges and Benefits - Data Science Central

Read More..

Cracking the Apache Spark Interview: 80+ Top Questions and Answers for 2024 – Simplilearn

Apache Spark is a unified analytics engine for processing large volumes of data. It can run workloads 100 times faster and offers over 80 high-level operators that make it easy to build parallel apps. Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, and can access data from multiple sources.

And this article covers the most important Apache Spark Interview questions that you might face in a Spark interview. The Spark interview questions have been segregated into different sections based on the various components of Apache Spark and surely after going through this article you will be able to answer most of the questions asked in your next Spark interview.

To learn more about Apache Spark interview questions, you can also watch the below video.

Apache Spark

MapReduce

Spark processes data in batches as well as in real-time

MapReduce processes data in batches only

Spark runs almost 100 times faster than Hadoop MapReduce

Hadoop MapReduce is slower when it comes to large scale data processing

Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it

Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve the data

Spark provides caching and in-memory data storage

Hadoop is highly disk-dependent

Apache Spark has 3 main categories that comprise its ecosystem. Those are:

This is one of the most frequently asked spark interview questions, and the interviewer will expect you to give a thorough answer to it.

Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.

Resilient Distributed Datasets are the fundamental data structure of Apache Spark. It is embedded in Spark Core. RDDs are immutable, fault-tolerant, distributed collections of objects that can be operated on in parallel.RDDs are split into partitions and can be executed on different nodes of a cluster.

RDDs are created by either transformation of existing RDDs or by loading an external dataset from stable storage like HDFS or HBase.

Here is how the architecture of RDD looks like:

So far, if you have any doubts regarding the apache spark interview questions and answers, please comment below.

When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.

Also Read: What Are the Skills Needed to Learn Hadoop?

Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.

To trigger the clean-ups, you need to set the parameter spark.cleaner.ttlx.

There are a total of 4 steps that can help you connect Spark to Apache Mesos.

Parquet is a columnar format that is supported by several data processing systems. With the Parquet file, Spark can perform both read and write operations.

Some of the advantages of having a Parquet file are:

Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The shuffle operation is implemented differently in Spark compared to Hadoop.

Shuffling has 2 important compression parameters:

spark.shuffle.compress checks whether the engine would compress shuffle outputs or not spark.shuffle.spill.compress decides whether to compress intermediate shuffle spill files or not

It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey

Spark uses a coalesce method to reduce the number of partitions in a DataFrame.

Suppose you want to read data from a CSV file into an RDD having four partitions.

This is how a filter operation is performed to remove all the multiple of 10 from the data.

The RDD has some empty partitions. It makes sense to reduce the number of partitions, which can be achieved by using coalesce.

This is how the resultant RDD would look like after applying to coalesce.

Consider the following cluster information:

Here is the number of core identification:

To calculate the number of executor identification:

Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:

There are 2 ways to convert a Spark RDD into a DataFrame:

import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB()

.where(field(first_name) === Peter)

.select(_id, first_name).toDF()

You can convert an RDD[Row] to a DataFrame by

calling createDataFrame on a SparkSession object

def createDataFrame(RDD, schema:StructType)

Resilient Distributed Dataset (RDD) is a rudimentary data structure of Spark. RDDs are the immutable Distributed collections of objects of any type. It records the data from various nodes and prevents it from significant faults.

The Resilient Distributed Dataset (RDD) in Spark supports two types of operations. These are:

The transformation function generates new RDD from the pre-existing RDDs in Spark. Whenever the transformation occurs, it generates a new RDD by taking an existing RDD as input and producing one or more RDD as output. Due to its Immutable nature, the input RDDs don't change and remain constant.

Along with this, if we apply Spark transformation, it builds RDD lineage, including all parent RDDs of the final RDDs. We can also call this RDD lineage as RDD operator graph or RDD dependency graph. RDD Transformation is the logically executed plan, which means it is a Directed Acyclic Graph (DAG) of the continuous parent RDDs of RDD.

The RDD Action works on an actual dataset by performing some specific actions. Whenever the action is triggered, the new RDD does not generate as happens in transformation. It depicts that Actions are Spark RDD operations that provide non-RDD values. The drivers and external storage systems store these non-RDD values of action. This brings all the RDDs into motion.

If appropriately defined, the action is how the data is sent from the Executor to the driver. Executors play the role of agents and the responsibility of executing a task. In comparison, the driver works as a JVM process facilitating the coordination of workers and task execution.

This is another frequently asked spark interview question. A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.

A Discretized Stream (DStream) is a continuous sequence of RDDs and the rudimentary abstraction in Spark Streaming. These RDDs sequences are of the same type representing a constant stream of data. Every RDD contains data from a specific interval.

The DStreams in Spark take input from many sources such as Kafka, Flume, Kinesis, or TCP sockets. It can also work as a data stream generated by converting the input stream. It facilitates developers with a high-level API and fault tolerance.

Caching also known as Persistence is an optimization technique for Spark computations. Similar to RDDs, DStreams also allow developers to persist the streams data in memory. That is, using the persist() method on a DStream will automatically persist every RDD of that DStream in memory. It helps to save interim partial results so they can be reused in subsequent stages.

The default persistence level is set to replicate the data to two nodes for fault-tolerance, and for input streams that receive data over the network.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark distributes broadcast variables using efficient broadcast algorithms to reduce communication costs.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

So far, if you have any doubts regarding the spark interview questions for beginners, please ask in the comment section below.

Moving forward, let us understand the spark interview questions for experienced candidates

DataFrame can be created programmatically with three steps:

1. map(func)

2. transform(func)

3. filter(func)

4. count()

The correct answer is c) filter(func).

This is one of the most frequently asked spark interview questions where the interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an answer as possible here.

Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.

Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

Controlling the transmission of data packets between multiple computer networks is done by the sliding window. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.

DISK_ONLY - Stores the RDD partitions only on the disk

MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array per partition

MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions wont be cached

OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory

MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk

MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk

See the rest here:

Cracking the Apache Spark Interview: 80+ Top Questions and Answers for 2024 - Simplilearn

Read More..