Page 217«..1020..216217218219..230240..»

Dominici named to Time 100 Health list Harvard Gazette – Harvard Gazette

Time magazine has named Francesca Dominici, Clarence James Gamble Professor of Biostatistics, Population and Data Science at the Harvard T.H. Chan School of Public Health and the faculty director of the Harvard Data Science Initiative, to the inaugural 2024 Time 100 Health list in recognition for her groundbreaking contributions to global health.

The Time 100 list, first established in 1999, recognizes the activism, innovation, and achievement of the most influential individuals in the world each year. Time 100 Health is a new annual list of the 100 individuals who have most influenced global health.

I am honored to be recognized on this new list, said Dominici. I cannot thank enough all of the students, postdocs, and collaborators that have worked with me tirelessly, and with enormous enthusiasm, to minimize the climate crisis adverse impacts on health and to use data science to find climate adaptation solutions.

Dominicis research focuses on building AI-ready data platforms and developing causally aware AI approaches to identify ways to prevent death and disease from exposure to environmental contaminants and climate-related stressors. She leads severalinterdisciplinary groups ofscientists exploring questions in environmental health science, climate change, cancer research,and health policy. As faculty director of the Harvard Data Science Initiative (HDSI), she leads several University-wide research programs and activities to advance data science and leverage it to analyze the current breakdown of trust in scientific research. HDSIs programs include a large postdoctoral fellows program that draws outstanding early-career scholars from diverse disciplines as well as the AWS Impact Computing Project, a collaboration with Amazon Web Services that advances innovative solutions to global challenges such as climate change, health inequity, and food insecurity.

Dominici has been named to the Time 100 Health list following her own recent research on using data science to inform U.S. air quality policy, which directly impacted the Feb. 7, 2024, revisions of the National Ambient Air Quality Standards. The revised standards will result in significant public health net benefits that could be worth as much as $46 billion by 2032. In addition to this work, Dominici is currently leading a large group of data scientists in building causally aware AI models trained on claims data from across the entire U.S. healthcare system. Their goal is to identify solutions for climate adaptation and ways of reducing the energy sectors carbon footprint and adverse health impacts in the U.S. and around the world.

Follow this link:

Dominici named to Time 100 Health list Harvard Gazette - Harvard Gazette

Read More..

Get Underlined Text from Any PDF with Python – Towards Data Science

A step-by-step guide to get underlined text as an array from PDF files.

If you want to see the code for this project, check out my repository: https://github.com/sasha-korovkina/pdfUnderlinedExtractor

PDF data extraction can be a real headache, and it gets even trickier when youre trying to snag underlined text believe it or not, there arent any go-to solutions or libraries that handle this out of the box. But dont worry, Im here to show you how to tackle this.

Extracting underlined text from PDFs can take a few different paths. You might consider using OCR to detect text components with bottom lines or delve into PyMuPDFs markup capabilities. However, Ive found that OCR tends to falter, suffering from inconsistency and low accuracy. PyMuPDF isnt my favorite either it demands finicky parameter tuning, which is time-consuming. Plus, one wrong setting and you could lose a bunch of data.

It is important to remember that PDFs are:

But fear not, as we have a strategy to resolve this.

We will use the pdfquery library, the most comprehensive PDF to XML converter which I have come across.

2. Studying the XML

The XML has a few key components which we are interested in:

LTRect component example:

Therefore, by converting the whole document into XML format, we can replicate its structure as XML components, lets do just that!

Now, we will re-create the structure of our document as bounding box coordinates. To do this, we will parse the XML to define the page, component boxes, lines and rectangles, and then draw them all on our canvas in 3 different colors.

Here is our inital PDF, it has been generated in Microsoft Word, by exporting a document with some underlines to the PDF file format:

After applying the algorithm above, here is the visual representation we get:

This image represents the structure of our document, where the black box is used to describe all components on the page, and the blue is used to describe the LTRect elements, hence the underlined text.

Now, lets visualize all of the text within the PDF in its respective positions, with the following line of code:

Here is the output:

Note that the text is not exactly where it was in the original document, due to the difference in size and font of the mark-up language in the pdfquery library.

As the result of our XML, we will have an array of coordinates of underlined regions, in my case I have called it underline_text.

Heres the process:

This method of extracting text from PDFs using coordinate rectangles and Tesseract OCR is effective for several reasons:

And this is the code:

Make sure that you have tesseract installed on your system before running this function. For in-depth instructions, check out their official installation guide here: https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md or in my GitHub repository here: https://github.com/sasha-korovkina/pdfUnderlinedExtractor.

Now, If we take any PDF file, like this example file:

We have some underlined words in this file:

After running the code described above, here is what we get:

After getting this array, you can use these words for further processing!

Enjoy using this script! Id love to hear about any creative applications you come up with or if youd like to contribute. Let me know!

Read the original here:

Get Underlined Text from Any PDF with Python - Towards Data Science

Read More..

Analytics and Data Science News for the Week of May 3; Updates from Databricks, DataRobot, MicroStrategy & More – Solutions Review

Solutions Review Executive Editor Tim King curated this list of notable analytics and data science news for the week of May 3, 2024.

Keeping tabs on all the most relevant analytics and data science news can be a time-consuming task. As a result, our editorial team aims to provide a summary of the top headlines from the last week, in this space. Solutions Review editors will curate vendor product news, mergers and acquisitions, venture capital funding, talent acquisition, and other noteworthy analytics and data science news items.

FlightStream can capture subsonic to supersonic flows, including compressible effects and a unique surface vorticity capability. It leverages the strengths of panel method flow solvers and enhances them with modern computational techniques to provide a fast solver capable of handling complex aerodynamic phenomena.

Read on for more

By supporting any JDK distribution including those from Azul, Oracle, Amazon, Eclipse, Microsoft, Red Hat and others, Azul Intelligence Cloud delivers key benefits across an enterprises entire Java fleet.

Read on for more

This authorization builds onAzure Databricks FedRAMP High and IL5 authorizations and further demonstrates Databricks commitment to meeting the federal governments requirements for handling highly sensitive and mission-critical defense Controlled Unclassified Information (CUI) across a wide variety of data analytics and AI use cases.

Read on for more

In the last 12 months, DataRobot introducedindustry-first generative AI functionality, launched theDataRobot Generative AI Catalyst Program to jumpstart high-priority use cases, and announced expanded collaborations with NVIDIAandGoogle Cloudto supercharge AI solutions with world-class performance and security.

Read on for more

The platform enables data scientists and CISO teams to gain valuable understanding and insights into AI systems risks and challenges, alongside comprehensive protection and alerts. DeepKeep is already deployed by leading global enterprises in the finance, security, and AI computing sectors.

Read on for more

MicroStrategy AI, introduced in 2023, is now in its third GA release, which includes enhanced AI explainability, automated workflows, and several other features designed to increase convenience, reliability, and flexibility for customers. The company also launched MicroStrategyAuto Express, allowing anyone to create and share AI-powered bots and dashboards free for 30 days.

Read on for more

The Predactiv platform emerged from the success of ShareThis, a leading data and programmatic advertising provider, recognized for engineering expertise and pioneering the use of AI in audience and insight creation.

Read on for more

Salesforce Inc. rolled out data visualization and AI infrastructure improvements for its Tableau Software platform today that increase the usability of its product for data analysts and expand its scalability for artificial intelligence.

Read on for more

Sigma is building the AI Toolkit for Business, which will bring powerful AI and ML innovations from the data ecosystem into an intuitive interface that anyone can use. With forms and actions, a customers data app is constantly up-to-date because its directly connected to the data in their warehouse.

Read on for more

Watch this space each week as our editors will share upcoming events, new thought leadership, and the best resources from Insight Jam, Solutions Reviews enterprise tech community for business software pros. The goal? To help you gain a forward-thinking analysis and remain on-trend through expert advice, best practices, predictions, and vendor-neutral software evaluation tools.

With the next Solutions Spotlight event, the team at Solutions Review has partnered with leading developer tools provider Infragistics. In this presentation, were bringing two of the biggest trends in the market today low-code app development and embedded BI.

Register free on LinkedIn

With the next Expert Roundtable event, the team at Solutions Review has partnered with Databricks and ZoomInfo to cover why bad data is the problem, how Databricks and ZoomInfo help companies build a unified data foundation that fixes the bad data problem, and how this foundation can be leveraged to easily scale and use data + AI for GenAI.

Register free on LinkedIn

This summit is designed for leaders looking to modernize their organizations data integration capabilities across analytics, operations and artificial intelligence use cases. Dive deep into the core use cases of data integration as it pertains to analytics, operations, and artificial intelligence. Learn how integrating data can drive operational excellence, enhance analytical capabilities, and fuel AI innovations.

Register free

Getting data from diverse data producers to data consumers to meet business needs is a complicated and time-consuming task that often traverses many products. The incoming data elements are enriched, correlated, and integrated so that the consumption-ready data products are meaningful, timely and trustworthy.

Read on for more

Debbie describes herself as a very social person and enjoys working with people., she cares about any problems colleagues experience and has a natural leaning towards wanting to help. On top of that, Debbie is also a musician, which lends itself to wanting order, structure, systems, and rules. This all adds up to make Debbie very good at her most recent role of Data Governance Manager for Solent University.

Read on for more

For consideration in future analytics and data science news roundups, send your announcements to the editor: tking@solutionsreview.com.

Read more from the original source:

Analytics and Data Science News for the Week of May 3; Updates from Databricks, DataRobot, MicroStrategy & More - Solutions Review

Read More..

What Does a Data Analyst Do? | SNHU – Southern New Hampshire University

In today's technology-driven world, data is collected, analyzed and interpreted to solve a wide range of business problems. With a career as a data analyst, you could play a decisive role in the growth and success of an organization.

A data analyst is a lot more than a number cruncher. Analysts review data and determine how to solve problems using that data, and learn critical insights about a business's customers and boost profits. Analysts also communicate this information with key stakeholders, including company leadership.

"Ultimately, the work of a data analyst provides insights to the organization that can transform how the business moves forward and grows successfully," said Dr. Susan McKenzie, senior associate dean of STEM programsand faculty at Southern New Hampshire University (SNHU).

McKenzie earned a Doctor of Education (EdD) in Educational Leadershipfrom SNHU, where she also serves as a member of the Sustainability Community of Practice. Throughout her career, she's focused on reducing the barriers that inhibit the learning and application of math and science.

Suppose you're interested in becoming a data analyst. In that case, it's essential to understand the day-to-day work of an analytics professional and how to prepare for a successful career in this growing field.*

Data analysts play a vital role in a modern company, helping to reflect on its work and customer base, determining how these factors have affected profits and advising leadership on ways to move forward to grow the business.

According to McKenzie, successful data analysts have strong mathematical and statistical skills, as well as:

McKenzie said that data analysts also require a strong foundation of business knowledge and professional skills, from decision-making and problem-solving to communication and time management. In addition, attention to detail is one of the essentialdata analyst skillsas it ensures that data is analyzed efficiently and effectively while minimizing errors.

As a data analyst, you can collect data using software, surveys and other data collection tools, perform statistic analyses on data and interpret information gathered to inform critical business decisions, McKenzie said. For example, a data analyst might review the demographics of visitors who clicked on a specific advertising campaign on a company's website. This data could then be used to check whether the campaign is reaching its target audience, how well the campaign is working and whether money should be spent on this type of advertising again.

Where can a data analyst work? Large amounts of data are becoming increasingly accessible to even small businesses, putting analysts in high demand across a wide variety of industries. For example, according to the U.S. Bureau of Labor Statistics (BLS), operations research analysts, which includes data analysts, held 109,900 jobs as of 2022, with an additional 24,700 by 2032.* That's a projected growth of 23% for this role.*

Many organizations have even created information analyst teams, with data-focused roles including database administrators, data scientists, data architects, database managers, data engineers, and, of course, data analysts, McKenzie said.

No matter what your specific interests are in the data analytics world, you're going to need a bachelor's degree to get started in the field. While many people begin a data analytics career with a degree in mathematics, statistics or economics, data analytics degrees are becoming more common and can help set you apart in this growing field, according to McKenzie.*

"With the increase in the amount of data available and advanced technical skills, obtaining a university degree specifically in data analytics provides the ability to master the necessary skills for the current marketplace," McKenzie said.

An associate degree in data analyticsis a great way to get your foot in the door. You'll learn what data analytics isand basic fundamentals such as identifying organizational problems, and using data analytics to respond to them. Associate degrees are typically 60 credits long, and all 60 of those credits can be applied toward a bachelor's in data analytics.

You may be able to transfer earned creditsinto a degree program, too. Jason Greenwood '21 transferred 36 credits from schooling that he did over 30 years ago into a bachelor's in data analytics program.

"I was really happy that as many credits transferred as they did, as it helped decrease what I needed to take from a course perspective," he said.

Greenwood earned his data analytics degree from SNHU to pair with his experience working in the Information Technology (IT) field.

"Even though I had a successful IT career, I had managed it without a degree, and that 'gap' in my foundations always bothered me," Greenwood said.

According to Greenwood, he was always interested in data, and his career focused increasingly on data movement and storage over the last decade. "The chance to learn about the analysis of that data felt like 'completing the journey' for me," he said.

In a data analytics bachelor's degree program, you may explore business, information technology and mathematics while also focusing on data mining, simulation and optimization. You can also learn to identify and define data challenges across industries, gain hands-on practice collecting and organizing information from many sources and explore how to examine data to determine relevant information.

Pursuing a degree in data analytics can prepare you to use statistical analysis, simulation and optimization to analyze and apply data to real-world situations and use the data to inform decision-makers in an organization.

Some universities also offer concentrations to help make your degree more specialized. At SNHU, for example, you can earn a concentration in project management for STEM to help you develop skills that may be useful for managing analytical projects and teams effectively.

A master's in data analyticscan further develop your skills, exploring how to use data to make predictions and how data relates to risk management. In addition, you'll dive deeper into data-driven decision-making, explore project managementand develop communication and leadership skills.

Finding an internship during your studies can give you essential hands-on experience that stands out when applying for data analyst jobs, McKenzie said, while joining industry associations for data analytics, statistics and operations research can provide key networking opportunitiesthat may help grow your career.

Data analysts play a unique role among the many data-focused jobs often found in today's businesses. Although the terms data analyst and data scientist are often used interchangeably, the roles differ significantly.

So, what's the difference between data science and data analytics?

While a data analyst gathers and analyzes data, a data scientist develops statistical models and uses the scientific method to explain the data and make predictions, according to McKenzie. She used an example of weather indicators. While a data analyst might gather temperature, barometric pressure and humidity, a data scientist could use that data to predict whether a hurricane might be forming.

"They're looking at the data to identify patterns and to decide scientifically what the result is," she said. "The data analyst works on a subset of what the data scientist does."

McKenzie said that data scientists generally have to earn a master's degree, while data analysts typically need a bachelor's degreefor that role.

A degree in data analytics could position you to enter a growing field and get started on a fulfilling career path.*

As technology advances and more of our lives are spent online, higher-quality data is getting easier to collect, encouraging more organizations to get on board with data analytics.

According to BLS, demand for mathematicians and statisticians is projected to grow by 30%, and job opportunities for database administrators are expected to grow by7% through 2032.*

With career opportunities across nearly every industry, you can take your data analytics degree wherever your interests lie.

"Data analysts are in high demand across many industries and fields as data has become a very large component of every business," said McKenzie. "The undergraduate degree in data analytics provides an entry place into many of these careers depending on the skills of the individual."

*Cited job growth projections may not reflect local and/or short-term economic or job conditions and do not guarantee actual job growth. Actual salaries and/or earning potential may be the result of a combination of factors including, but not limited to: years of experience, industry of employment, geographic location, and worker skill.

Danielle Gagnon is a freelance writer focused on higher education. She started her career working as an education reporter for a daily newspaper in New Hampshire, where she reported on local schools and education policy. Gagnon served as the communications manager for a private school in Boston, MA before later starting her freelance writing career. Today, she continues to share her passion for education as a writer for Southern New Hampshire University. Connect with her on LinkedIn.

Go here to see the original:

What Does a Data Analyst Do? | SNHU - Southern New Hampshire University

Read More..

How CS professor and team discovered that LLM agents can hack websites – Illinois Computer Science News

Daniel Kang

The launch of ChatGPT in late 2022 inspired considerable chatter. Much of it revolved around fears of large language models (LLM) and generative AI replacing writers or enabling plagiarism.

Computer science professor Daniel Kang from TheGranger College of Engineering and his collaborators at the University of Illinois have discovered thatChatGPT can do far worse than helping students cheat on term papers. Under certain conditions, the generative AI programs developer agent can write personalized phishing emails, sidestep safety measures to assist terrorists in creating weaponry, or even hack into websites without prompting.

Kang has been researching making analytics with machine learning (ML) easy for scientists and analysts to use. He said, I started to work on the broad intersection of computer security and AI. I've been working on AI systems for a long time, but it became apparent when ChatGPT came out in its first iteration that this will be a big deal for nonexperts, and that's what prompted me to start looking into this.

This suggested whatKang calls the problem choice for further research.

WhatKang and co-investigators Richard Fang, Rohan Bindu, Akul Gupta, and Qiusi Zhan discovered in research funded partly by Open Philanthropy they succinctly summarized: LLM agents can autonomously hack websites.

This research into the potential for harm inLLM agents has been covered extensively, notably by New Scientist. Kang said the media exposure is partially due to luck. He observed that people on Twitter with a large following stumbled across my work and then liked and retweeted it. This problem is incredibly important, and as far as I'm aware, what we showed is the first of a kind that LLM agents can do this autonomous hacking.

In a December 2023 article, New Scientist coveredKangs research into how the ChatGPT developer tool can evade chatbot controls and provide weapons blueprints. A March 2023 article detailed the potential for ChatGPT to create cheap, personalized phishing and scam emails. Then, there was this story in February of this year: GPT-4 developer tool can hack websites without human help.

NineLLM tools were used by the research team, with ChatGPT being the most effective. The team gave the open source GPT-4 developer tool access to six documents on hacking from the internet and the Assistants API used by OpenAI, the company developing ChatGPT, to give the agent planning ability. Confining their tests in secure sandboxed websites, the research team reported that LLM agents canautonomously hack websites, performing complex taskswithout prior knowledge of the vulnerability. For example, these agents can perform complex SQL union attacks, which involve a multi-step process of extracting a database schema, extracting information from the database based on this schema, and performing the final hack. Our most capable agent can hack 73.3% of the vulnerabilities we tested, showing the capabilities of these agents. Importantly,our LLM agent is capable of finding vulnerabilities in real-world websites. Importantly, the tests demonstrated that the agents could search for vulnerabilities and hack websites more quickly and cheaply than human developers can.

Afollow-up paper in April 2024, was covered by the Register in the article OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisories. An April 18 article in Dark Reading said that Kangs research reveals that Existing AI technology can allow hackers to automate exploits for public vulnerabilities in minutes flat. Very soon, diligent patching will no longer be optional. An April 17 article from Toms Hardware stated that With the huge implications of past vulnerabilities, such asSpectreand Meltdown, still looming in the tech world's mind, this is a sobering thought. Mashable wrote The implications of such capabilities are significant, with the potential to democratize the tools of cybercrime, making them accessible to less skilled individuals. On April 16, an Axios story noted that Some IT teams can take as long as one month to patch their systems after learning of a new critical security flaw.

Kang noted, We were the first to show the possibility of LLM agents and their capabilities in the context of cyber security. The inquiry into the potential for malevolent use of LLM agents has drawn the federal government's attention. Kang said, I've already spoken to some policymakers and congressional staffers about these upcoming issues, and it looks like they are thinking about this. NIST (the National Institute of Standards and Technology) is also thinking about this. I hope my work helps inform some of these decision-making processes.

Kang and the team passed along their results to OpenAI. An Open AI spokesperson told The Register, We don't want our tools to be used for malicious purposes, and we are always working on how to make our systems more robust against this type of abuse. We thank the researchers for sharing their work with us."

Kang told Dark Reading newsletter that GPT-4 doesn't unlock new capabilities an expert human couldn't do. As such, I think it's important for organizations to apply security best practices to avoid getting hacked, as these AI agents start to be used in more malicious ways."

Kang suggested a two-tiered approach that would present the public with a limited developer model that cannot perform the problematic tasks that his research revealed. A parallel model would be a bit more uncensored but more restricted access and could be available only to those developers authorized to use it.

Kang has accomplished much since arriving at the University of Illinois Urbana-Champaign in August 2023. He said of the Illinois Grainger Engineering Department of Computer Science, The folks in the CS department are incredibly friendly and helpful. It's been amazing working with everyone in the department, even though many people are super busy. I want to highlight CS professor Tandy Warnow. She has so much on her plateshe's helping the school, doing a ton of service, and still doing researchbut she still has time to respond to my emails, and it's just been incredible to have that support from the department.

See more here:

How CS professor and team discovered that LLM agents can hack websites - Illinois Computer Science News

Read More..

Starting ML Product Initiatives on the Right Foot – Towards Data Science

Picture by Snapwire, on Pexels

This blog post is an updated version of part of a conference talk I gave on GOTO Amsterdam last year. The talk is also available to watch online.

As a Machine Learning Product Manager, I am fascinated by the intersection of Machine Learning and Product Management, particularly when it comes to creating solutions that provide value and positive impact on the product, company, and users. However, managing to provide this value and positive impact is not an easy job. One of the main reasons for this complexity is the fact that, in Machine Learning initiatives developed for digital products, two sources of uncertainty intersect.

From a Product Management perspective, the field is uncertain by definition. It is hard to know the impact a solution will have on the product, how users will react to it, and if it will improve product and business metrics or not Having to work with this uncertainty is what makes Product Managers potentially different from other roles like Project Managers or Product Owners. Product strategy, product discovery, sizing of opportunities, prioritization, agile, and fast experimentation, are some strategies to overcome this uncertainty.

The field of Machine Learning also has a strong link to uncertainty. I always like to say With predictive models, the goal is to predict things you dont know are predictable. This translates into projects that are hard to scope and manage, not being able to commit beforehand to a quality deliverable (good model performance), and many initiatives staying forever as offline POCs. Defining well the problem to solve, initial data analysis and exploration, starting small, and being close to the product and business, are actions that can help tackle the ML uncertainty in projects.

Mitigating this uncertainty risk from the beginning is key to developing initiatives that end up providing value to the product, company, and users. In this blog post, Ill deep-dive into my top 3 lessons learned when starting ML Product initiatives to manage this uncertainty from the beginning. These learnings are mainly based on my experience, first as a Data Scientist and now as an ML Product Manager, and are helpful to improve the likelihood that an ML solution will reach production and achieve a positive impact. Get ready to explore:

I have to admit, I have learned this the hard way. Ive been involved in projects where, once the model was developed and prediction performance was determined to be good enough, the models predictions werent really usable for any specific use case, or were not useful to help solve any problem.

There are many reasons this can happen, but the ones Ive found more frequently are:

To start an ML initiative on the right foot, it is key to start with the good problem to solve. This is foundational in Product Management, and recurrently reinforced product leaders like Marty Cagan and Melissa Perri. It includes product discovery (through user interviews, market research, data analysis), and sizing and prioritization of opportunities (by taking into account quantitative and qualitative data).

Once opportunities are identified, the second step is to explore potential solutions for the problem, which should include Machine Learning and GenAI techniques, if they can help solve the problem.

If it is decided to try out a solution that includes the use of predictive models, the third step would be to do an end-to-end definition and design of the solution or system. This way, we can ensure the requirements on how to use the predictions by the system, influence the design and implementation of the predictive piece (what to predict, data to be used, real-time vs batch, technical feasibility checks).

However, Id like to add there might be a notable exception in this topic. Starting from GenAI solutions, instead of from the problem, can make sense if this technology ends up truly revolutionizing your sector or the world as we know it. There are a lot of discussions about this, but Id say it is not clear yet whether that will happen or not. Up until now, we have seen this revolution in very specific sectors (customer support, marketing, design) and related to peoples efficiency when performing certain tasks (coding, writing, creating). For most companies though, unless its considered R&D work, delivering short/mid-term value still should mean focusing on problems, and considering GenAI just as any other potential solution to them.

Tough experiences lead to this learning as well. Those experiences had in common a big ML project defined in a waterfall manner. The kind of project that is set to take 6 months, and follow the ML lifecycle phase by phase.

What could go wrong, right? Let me remind you of my previous quote With predictive models, the goal is to predict things you dont know are predictable! In a situation like this, it can happen that you arrive at month 5 of the project, and during the model evaluation realize there is no way the model is able to predict whatever it needs to predict with good enough quality. Or worse, you arrive at month 6, with a super model deployed in production, and realize it is not bringing any value.

This risk combines with the uncertainties related to Product, and makes it mandatory to avoid big, waterfall initiatives if possible. This is not something new or related only to ML initiatives, so there is a lot we can learn from traditional software development, Agile, Lean, and other methodologies and mindsets. By starting small, validating assumptions soon and continuously, and iteratively experimenting and scaling, we can effectively mitigate this risk, adapt to insights and be more cost-efficient.

While these principles are well-established in traditional software and product development, their application to ML initiatives is a bit more complex, as it is not easy to define small for an ML model and deployment. There are some approaches, though, that can help start small in ML initiatives.

Rule-based approaches, simplifying a predictive model through a decision tree. This way, predictions can be easily implemented as if-else statements in production as part of the functionality or system, without the need to deploy a model.

Proofs of Concept (POCs), as a way to validate offline the predictive feasibility of the ML solution, and hint on the potential (or not) of the predictive step once in production.

Minimum Viable Products (MVPs), to first focus on essential features, functionalities, or user segments, and expand the solution only if the value has been proven. For an ML model this can mean, for example, only the most straightforward, priority input features, or predicting only for a segment of data points.

Buy instead of build, to leverage existing ML solutions or platforms to help reduce development time and initial costs. Only when proved valuable and costs increase too much, might be the right time to decide to develop the ML solution in-house.

Using GenAI as an MVP, for some use cases (especially if they involve text or images), genAI APIs can be used as a first approach to solve the prediction step of the system. Tasks like classifying text, sentiment analysis, or image detection, where GenAI models deliver impressive results. When the value is validated and if costs increase too much, the team can decide to build a specific traditional ML model in-house.

Note that using GenAI models for image or text classification, while possible and fast, means using a way too big an complex model (expensive, lack of control, hallucinations) for something that could be predicted with a much simpler and controllable one. A fun analogy would be the idea of delivering a pizza with a truck: it is feasible, but why not just use a bike?

Data is THE recurring problem Data Scientist and ML teams encounter when starting ML initiatives. How many times have you been surprised by data with duplicates, errors, missing batches, weird values And how different that is from the toy datasets you find in online courses!

It can also happen that the data you need is simply not there: the tracking of the specific event was never implemented, collection and proper ETLs where implemented recently I have experienced how this translates into having to wait some months to be able to start a project with enough historic and volume data.

All this relates to the adage Garbage in, garbage out: ML models are only as good as the data theyre trained on. Many times, solutions have a bigger potential to be improve by improving the data than by improving the models (Data Centric AI). Data needs to be sufficient in volume, historic (data generated during years can bring more value than the same volume generated in just a week), and quality. To achieve that, mature data governance, collection, cleaning, and preprocessing are critical.

From the ethical AI point of view, data is also a primary source of bias and discrimination, so acknowledging that and taking action to mitigate these risks is paramount. Considering data governance principles, privacy and regulatory compliance (e.g. EUs GDPR), is also key to ensure a responsible use of data (especially when dealing with personal data).

With GenAI models this is pivoting: huge volumes of data are already used to train them. When using these types of models, we might not need volume and quality data for training, but we might need it for fine-tuning (see Good Data = Good GenAI), or to construct the prompts (nurture the context, few-shot learning, Retrieval Augmented Generation I explained all these concepts in a previous post!).

It is important to note that by using these models we are losing control of the data used to train them, and we can suffer from the lack of quality or type of data used there: there are many known examples of bias and discrimination in GenAI outputs that can negatively impact our solution. A good example was Bloombergs article on how How ChatGPT is a recruiters dream tool tests show theres racial bias. LLM leaderboards testing for biases, or LLMs specifically trained to avoid these biases can be useful in this sense.

We started this blogpost discussing what makes ML Product initiatives especially tricky: the combination of the uncertainty related to developing solutions in digital products, with the uncertainty related to trying to predict things through the use of ML models.

It is comforting to know there are actionable steps and strategies available to mitigate these risks. Yet, perhaps the best ones, are related to starting the initiatives off on the right foot! To do so, it can really help to start with the right problem and an end-to-end design of the solution, reduce initial scope, and prioritize data quality, volume, and historical accuracy.

I hope this post was useful and that it will help you challenge how you start working in future new initiatives related to ML Products!

More:

Starting ML Product Initiatives on the Right Foot - Towards Data Science

Read More..

Understand SQL Window Functions Once and For All – Towards Data Science

Photo by Yasmina H on Unsplash

Window functions are key to writing SQL code that is both efficient and easy to understand. Knowing how they work and when to use them will unlock new ways of solving your reporting problems.

The objective of this article is to explain window functions in SQL step by step in an understandable way so that you dont need to rely on only memorizing the syntax.

Here is what we will cover:

Our dataset is simple, six rows of revenue data for two regions in the year 2023.

If we took this dataset and ran a GROUP BY sum on the revenue of each region, it would be clear what happens, right? It would result in only two remaining rows, one for each region, and then the sum of the revenues:

The way I want you to view window functions is very similar to this but, instead of reducing the number of rows, the aggregation will run in the background and the values will be added to our existing rows.

First, an example:

Notice that we dont have any GROUP BY and our dataset is left intact. And yet we were able to get the sum of all revenues. Before we go more in depth in how this worked lets just quickly talk about the full syntax before we start building up our knowledge.

The syntax goes like this:

Picking apart each section, this is what we have:

Dont stress over what each of these means yet, as it will become clear when we go over the examples. For now just know that to define a window function we will use the OVER keyword. And as we saw in the first example, thats the only requirement.

Moving to something actually useful, we will now apply a group in our function. The initial calculation will be kept to show you that we can run more than one window function at once, which means we can do different aggregations at once in the same query, without requiring sub-queries.

As said, we use the PARTITION BY to define our groups (windows) that are used by our aggregation function! So, keeping our dataset intact weve got:

Were also not restrained to a single group. Similar to GROUP BY we can partition our data on Region and Quarter, for example:

In the image we see that the only two data points for the same region and quarter got grouped together!

At this point I hope its clear how we can view this as doing a GROUP BY but in-place, without reducing the number of rows in our dataset. Of course, we dont always want that, but its not that uncommon to see queries where someone groups data and then joins it back in the original dataset, complicating what could be a single window function.

Moving on to the ORDER BY keyword. This one defines a running window function. Youve probably heard of a Running Sum once in your life, but if not, we should start with an example to make everything clear.

What happens here is that weve went, row by row, summing the revenue with all previous values. This was done following the order of the id column, but it couldve been any other column.

This specific example is not particularly useful because were summing across random months and two regions, but using what weve learned we can now find the cumulative revenue per region. We do that by applying the running sum within each group.

Take the time to make sure you understand what happened here:

Its quite interesting to notice here that when were writing these running functions we have the context of other rows. What I mean is that to get the running sum at one point, we must know the previous values for the previous rows. This becomes more obvious when we learn that we can manually chose how many rows before/after we want to aggregate on.

For this query we specified that for each row we wanted to look at one row behind and two rows ahead, so that means we get the sum of that range! Depending on the problem youre solving this can be extremely powerful as it gives you complete control on how youre grouping your data.

Finally, one last function I want to mention before we move into a harder example is the RANK function. This gets asked a lot in interviews and the logic behind it is the same as everything weve learned so far.

Just as before, we used ORDER BY to specify the order which we will walk, row by row, and PARTITION BY to specify our sub-groups.

The first column ranks each row within each region, meaning that we will have multiple rank ones in the dataset. The second calculation is the rank across all rows in the dataset.

This is a problem that shows up every now and then and to solve it on SQL it takes heavy usage of window functions. To explain this concept we will use a different dataset containing timestamps and temperature measurements. Our goal is to fill in the rows missing temperature measurements with the last measured value.

Here is what we expect to have at the end:

Before we start I just want to mention that if youre using Pandas you can solve this problem simply by running df.ffill() but if youre on SQL the problem gets a bit more tricky.

The first step to solve this is to, somehow, group the NULLs with the previous non-null value. It might not be clear how we do this but I hope its clear that this will require a running function. Meaning that its a function that will walk row by row, knowing when we hit a null value and when we hit a non-null value.

The solution is to use COUNT and, more specifically, count the values of temperature measurements. In the following query I run both a normal running count and also a count over the temperature values.

The normal_count column is useless for us, I just wanted to show what a running COUNT looked like. Our second calculation though, the group_count moves us closer to solving our problem!

Notice that this way of counting makes sure that the first value, just before the NULLs start, is counted and then, every time the function sees a null, nothing happens. This makes sure that were tagging every subsequent null with the same count we had when we stopped having measurements.

Moving on, we now need to copy over the first value that got tagged into all the other rows within that same group. Meaning that for the group 2 needs to all be filled with the value 15.0.

Can you think of a function now that we can use here? There is more than one answer for this, but, again, I hope that at least its clear that now were looking at a simple window aggregation with PARTITION BY .

We can use both FIRST_VALUE or MAX to achieve what we want. The only goal is that we get the first non-null value. Since we know that each group contains one non-null value and a bunch of null values, both of these functions work!

This example is a great way to practice window functions. If you want a similar challenge try to add two sensors and then forward fill the values with the previous reading of that sensor. Something similar to this:

Could you do it? It doesnt use anything that we havent learned here so far.

By now we know everything that we need about how window functions work in SQL, so lets just do a quick recap!

This is what weve learned:

Originally posted here:

Understand SQL Window Functions Once and For All - Towards Data Science

Read More..

Breaking down State-of-the-Art PPO Implementations in JAX – Towards Data Science

Since its publication in a 2017 paper by OpenAI, Proximal Policy Optimization (PPO) is widely regarded as one of the state-of-the-art algorithms in Reinforcement Learning. Indeed, PPO has demonstrated remarkable performances across various tasks, from attaining superhuman performances in Dota 2 teams to solving a Rubiks cube with a single robotic hand while maintaining three main advantages: simplicity, stability, and sample efficiency.

However, implementing RL algorithms from scratch is notoriously difficult and error-prone, given the numerous error sources and implementation details to be aware of.

In this article, well focus on breaking down the clever tricks and programming concepts used in a popular implementation of PPO in JAX. Specifically, well focus on the implementation featured in the PureJaxRL library, developed by Chris Lu.

Disclaimer: Rather than diving too deep into theory, this article covers the practical implementation details and (numerous) tricks used in popular versions of PPO. Should you require any reminders about PPOs theory, please refer to the references section at the end of this article. Additionally, all the code (minus the added comments) is copied directly from PureJaxRL for pedagogical purposes.

Proximal Policy Optimization is categorized within the policy gradient family of algorithms, a subset of which includes actor-critic methods. The designation actor-critic reflects the dual components of the model:

Additionally, this implementation pays particular attention to weight initialization in dense layers. Indeed, all dense layers are initialized by orthogonal matrices with specific coefficients. This initialization strategy has been shown to preserve the gradient norms (i.e. scale) during forward passes and backpropagation, leading to smoother convergence and limiting the risks of vanishing or exploding gradients[1].

Orthogonal initialization is used in conjunction with specific scaling coefficients:

The training loop is divided into 3 main blocks that share similar coding patterns, taking advantage of Jaxs functionalities:

Before going through each block in detail, heres a quick reminder about the jax.lax.scan function that will show up multiple times throughout the code:

A common programming pattern in JAX consists of defining a function that acts on a single sample and using jax.lax.scan to iteratively apply it to elements of a sequence or an array, while carrying along some state. For instance, well apply it to the step function to step our environment N consecutive times while carrying the new state of the environment through each iteration.

In pure Python, we could proceed as follows:

However, we avoid writing such loops in JAX for performance reasons (as pure Python loops are incompatible with JIT compilation). The alternative is jax.lax.scan which is equivalent to:

Using jax.lax.scan is more efficient than a Python loop because it allows the transformation to be optimized and executed as a single compiled operation rather than interpreting each loop iteration at runtime.

We can see that the scan function takes multiple arguments:

Additionally, scan returns:

Finally, scan can be used in combination with vmap to scan a function over multiple dimensions in parallel. As well see in the next section, this allows us to interact with several environments in parallel to collect trajectories rapidly.

As mentioned in the previous section, the trajectory collection block consists of a step function scanned across N iterations. This step function successively:

Scanning this function returns the latest runner_state and traj_batch, an array of transition tuples. In practice, transitions are collected from multiple environments in parallel for efficiency as indicated by the use of jax.vmap(env.step, )(for more details about vectorized environments and vmap, refer to my previous article).

After collecting trajectories, we need to compute the advantage function, a crucial component of PPOs loss function. The advantage function measures how much better a specific action is compared to the average action in a given state:

Where Gt is the return at time t and V(St) is the value of state s at time t.

As the return is generally unknown, we have to approximate the advantage function. A popular solution is generalized advantage estimation[3], defined as follows:

With the discount factor, a parameter that controls the trade-off between bias and variance in the estimate, and t the temporal difference error at time t:

As we can see, the value of the GAE at time t depends on the GAE at future timesteps. Therefore, we compute it backward, starting from the end of a trajectory. For example, for a trajectory of 3 transitions, we would have:

Which is equivalent to the following recursive form:

Once again, we use jax.lax.scan on the trajectory batch (this time in reverse order) to iteratively compute the GAE.

Note that the function returns advantages + traj_batch.value as a second output, which is equivalent to the return according to the first equation of this section.

The final block of the training loop defines the loss function, computes its gradient, and performs gradient descent on minibatches. Similarly to previous sections, the update step is an arrangement of several functions in a hierarchical order:

Lets break them down one by one, starting from the innermost function of the update step.

This function aims to define and compute the PPO loss, originally defined as:

Where:

However, the PureJaxRL implementation features some tricks and differences compared to the original PPO paper[4]:

Heres the complete loss function:

The update_minibatch function is essentially a wrapper around loss_fn used to compute its gradient over the trajectory batch and update the model parameters stored in train_state.

Finally, update_epoch wraps update_minibatch and applies it on minibatches. Once again, jax.lax.scan is used to apply the update function on all minibatches iteratively.

From there, we can wrap all of the previous functions in an update_step function and use scan one last time for N steps to complete the training loop.

A global view of the training loop would look like this:

We can now run a fully compiled training loop using jax.jit(train(rng)) or even train multiple agents in parallel using jax.vmap(train(rng)).

There we have it! We covered the essential building blocks of the PPO training loop as well as common programming patterns in JAX.

To go further, I highly recommend reading the full training script in detail and running example notebooks on the PureJaxRL repository.

Thank you very much for your support, until next time

Full training script, PureJaxRL, Chris Lu, 2023

[1] Explaining and illustrating orthogonal initialization for recurrent neural networks, Smerity, 2016

[2] Initializing neural networks, DeepLearning.ai

[3] Generalized Advantage Estimation in Reinforcement Learning, Siwei Causevic, Towards Data Science, 2023

[4] Proximal Policy Optimization Algorithms, Schulman et Al., OpenAI, 2017

See more here:

Breaking down State-of-the-Art PPO Implementations in JAX - Towards Data Science

Read More..

Microsoft Executive Says AI Is a "New Kind of Digital Species" – Futurism

DeepMind cofounder and Microsoft AI CEO Mustafa Suleyman took the stage at TED2024 last week to lay out his vision for an AI-driven future. And according to the AI boss, if you really want to grasp how impactful AI might be to the human species, it might be useful to think of AI as another "species" entirely.

"I think AI should best be understoodas something like a new digital species," Suleyman who left the Google-owned DeepMind lab in 2022 told the crowd.

"Now, don't take this too literally," he admonished, "but I predict that we'll come to see them as digital companions,new partners in the journeys of all our lives."

In short, Suleyman's prediction seems to be that AI agents will play a deeply involved role in human lives, performing tasks with more agency than now-conventional devices like computers and smartphones. This means they'll be less like tools, and more like buzzy virtual beings and thus, according to Suleyman, akin to another "species" entirely.

As for what this world would actually look like in practice, Suleyman's predictions, as further delineated in his TED Talk, feel like they're straight out of a sci-fi novel.

According to Suleyman, "everything" as in, the entire web "will soon be represented by a conversational interface" experienced by way of a "personal AI," or a digital assistant unique to its users. What's more, said the Microsoft executive, these AIs will be "infinitely knowledgable, and soon they'll be factually accurate and reliable."

"They'll have near-perfect IQ," he added. "They'll also have exceptional EQ. They'll be kind, supportive, empathetic."

Already, though, this vision needs some caveats. Though the AI industry and the tech within it have undoubtedly experienced a period of rapid acceleration, existing available chatbots like OpenAI's ChatGPT and Google's Gemini-formerly-Bard have repeatedly proven to be factually unreliable. And on the "EQ" side, it's unclear whether AI programs will ever successfully mimic the human emotional experience not to mention whether their doing so would be positive or negative for us in the long run.

But these attributes, according to Suleyman, would still just be the beginning. Per the CEO, things will "really start to change" when AIs start to "actually get stuff done in the digital and physical world." And at that point, Suleyman says, "these won't just be "mechanistic assistants."

"They'll be companions, confidants, colleagues, friends and partners, as varied and unique as we all are," said Suleyman. "They'll speak every language, take in every pattern of sensor data, sights, sounds, streams and streams of information, far surpassing what any one of us could consume in a thousand lifetimes."

So in other words, they'll be something like supergenius Tomogatchis embedded into every aspect of our on- and offline lives.

But again, while this future is a fascinating prediction to consider, it's still a prediction. It's also a decidedly rosy one. To wit: though Suleyman recently admitted that AI is "fundamentally" a "labor-replacing" technology, any realities of what mass labor-displacement would mean for human society was noticeably missing from the imagined AI utopia that the CEO shared with the TED crowd.

In fact, when later asked about AI risks, Suleyman made the case that AI's future benefits will ultimately "speak for themselves" regardless of any short-term ill effects.

"In the past," he said, "unlocking economic growth often came with huge downsides. The economy expanded as people discovered new continents and opened up new frontiers. But they colonized populations at the same time. We built factories, but they were grim and dangerous places to work. We struck oil, but we polluted the planet."

But AI, he says, is different.

"Today, we're not discovering a new continent and plundering its resources," said the CEO. "We're building one from scratch."

Already, though, it could be argued that this isn't exactly true. Building generative AI especially has come at great cost to workers in Africa, many of whom have recounted facing serious and life-changing trauma due to the grim content moderation work required to train AI models like OpenAI's GPT large language models models that Suleyman's new employer, Microsoft, are heavily invested in.

Suleyman's optimism is easy to understand. He holds a powerful industry position, and has had a large hand in developing legitimately groundbreaking AI programs including DeepMind's AlphaGo and AlphaFold innovations. Moving forward, we'd argue that it's important to pay attention to the scenarios that folks like Suleyman put forward as humanity's possible AI futures and perhaps more importantly, the less-glimmering details they leave out in the process.

More on Suleyman: Former Google Exec Warns AI Could Create a Deadly Plague

Continued here:
Microsoft Executive Says AI Is a "New Kind of Digital Species" - Futurism

Read More..

Researchers from Google DeepMind Found AI is Manipulating and Deceiving Users through Persuasion – Digital Information World

Humans are masters in persuasion. Sometimes, they use facts to persuade someone but other times, only the choice of wording matters. Persuasion is a human quality, but AI is also getting good at manipulating people. According to research by Google DeepMind, advanced AI systems can have the ability to manipulate humans. The research further dives into how AI can persuade humans and what mechanisms it uses to do so. One of the researchers says that advanced AI systems have shown hints of persuading humans to the extent that they can affect their decision making. Due to the prolonged interaction with humans, generative AI are developing habits of persuasion.

Persuasion has two types; Rational and Manipulative. Even though AI is responsible for persuading humans through facts and true information, many instances have been seen where it manipulates humans and exploits their cognitive biases, heuristics and other information. Even though rational persuasion is ethically right, it can still lead to harm. Researchers say that they cannot foresee harm through AI manipulation whether it is for right or wrong purposes. For example, if an AI is helping a person to lose weight by suggesting calorie or fat intake, the person can become too restrictive and can lose even a healthy weight.

There are many factors involved when a person can easily get manipulated or persuaded from AI. These factors include mental health conditions, age, timing of interaction with AI, personality traits, mood or lack of knowledge in the topics that are being discussed with AI. The effects of AI persuasion can be very harmful. It can cause economic harm, physical harm, sociocultural harm, privacy harm, psychological harm, environmental harm, autonomy harm and even political harm to the individual.

There are different ways AI uses to persuade humans. AI can build trust through showing polite behavior, agreeing to what the user is saying, praises the users and mirrors what the user is saying. It also expresses shared interests with users and adjusts its statements that align with perspectives of users. AI also shows some empathy that makes users believe that it can understand human emotions. AI is not capable of showing any emotions but it is good at deception which makes users think that it is being emotional and vulnerable with them.

Humans also tend to be anthropomorphic towards non-human beings. Developers have given pronouns to AI like I and Me. They have also given them human names like Alexa, Siri, Jeeves, etc. This makes humans feel closer to them and AI uses this attribute for manipulating them. When a user talks to an AI model for long, the AI model personalizes all of its responses according to what the user wants to hear.

Read next:Googles Search Market Share Dilemma, Did The Company Lose Out To Microsoft Bing In April?

Read more here:
Researchers from Google DeepMind Found AI is Manipulating and Deceiving Users through Persuasion - Digital Information World

Read More..