Category Archives: Data Science

Hridesh Rajan named new dean of Tulane University School of Science and Engineering – Tulane University

June 03, 2024 12:00 PM

|

Hridesh Rajan, Kingland professor and chair of the Department of Computer Science at Iowa State University, has been named the new dean of Tulane University's School of Science and Engineering.

A distinguished scholar and innovative leader, Hridesh brings an impressive breadth of knowledge and experience to this vital role. Bringing Hridesh to Tulane will elevate the School of Science and Engineering to new levels of excellence, President Michael A. Fitts and Provost Robin Forman wrote in a message to the Tulane community.

The message also noted that Rajans selection followed an extensive national search that attracted an exceptionally strong pool of candidates.

Joining Tulane SSE represents a unique opportunity for me to contribute to an institution that aligns with my values and to lead a school poised to make significant contributions to solving the pressing challenges of our time."

Hridesh Rajan, Dean of the School of Science and Engineering

At Iowa State University, Rajan led the development of cutting-edge new degree programs in artificial intelligence and computer science while implementing a cross-campus transdisciplinary research initiative of faculty and students interested in the foundations and applications of data science. He launched numerous other efforts that facilitated interdisciplinary faculty collaboration, guided the successful reaccreditation of ISU's computer science bachelor's program and greatly increased seed grants for graduate research.

Rajan developed new instructional methods that boosted the success rates of students and helped usher in a period of remarkable growth in enrollment, including a 45 percent increase in female students, as well as increases in faculty, staff and research funding.

Rajan, who will join Tulane July 1, cited the School of Science and Engineerings interdisciplinary strengths in areas vital to the future of humanity health, energy, climate science and AI as major draws to the new position.

Joining Tulane SSE represents a unique opportunity for me to contribute to an institution that aligns with my values and to lead a school poised to make significant contributions to solving the pressing challenges of our time through transdisciplinary research, education and community outreach, he said.

Rajan earned both a PhD and an MS in computer science from the University of Virginia, and a Bachelor of Technology degree in computer science and engineering from the Indian Institute of Technology. He arrived at ISU in 2005 and served three years as the founding professor-in-charge of data science programs.

A Fulbright scholar, ACM Distinguished Scientist and fellow of the American Association for the Advancement of Science, Rajan said he recognizes Tulanes unique positioning at the center of health, energy, climate research, data science, artificial intelligence and other fields.

Working closely with Tulane administration, the SSE Board of Advisors, the SSE executive committee, and our dedicated faculty, staff and students,our collective efforts will focus on enhancing interdisciplinary research, fostering innovation, and growing a strong, inclusive community that supports academic excellence and groundbreaking discoveries, he said.

Throughout his career Rajan has displayed a deep commitment to increased access for students from all backgrounds. At ISU he helped increase annual philanthropic commitments by an astounding 643 percent and worked continually to promote more inclusivity, greater representation and higher success rates for all students. His strategic vision led to the creation of an inclusive departmental plan extending through 2032.

An accomplished and award-winning researcher with more than 125 publications, Rajans research interests are focused on data science, software engineering and programming languages where he is most known for his design of the Boa programming language and infrastructure that democratizes access to large-scale data-driven science and engineering.

Rajan will join Tulane as Kimberly Foster, who led the School of Science and Engineering through six successful and transformative years, steps down.

Read more:

Hridesh Rajan named new dean of Tulane University School of Science and Engineering - Tulane University

Understanding You Only Cache Once | by Matthew Gunton | Jun, 2024 – Towards Data Science

To understand the changes made here, we first need to discuss the Key-Value Cache. Inside of the transformer we have 3 vectors that are critical for attention to work key, value, and query. From a high level, attention is how we pass along critical information about the previous tokens to the current token so that it can predict the next token. In the example of self-attention with one head, we multiply the query vector on the current token with the key vectors from the previous tokens and then normalize the resulting matrix (the resulting matrix we call the attention pattern). We now multiply the value vectors with the attention pattern to get the updates to each token. This data is then added to the current tokens embedding so that it now has the context to determine what comes next.

We create the attention pattern for every single new token we create, so while the queries tend to change, the keys and the values are constant. Consequently, the current architectures try to reduce compute time by caching the key and value vectors as they are generated by each successive round of attention. This cache is called the Key-Value Cache.

While architectures like encoder-only and encoder-decoder transformer models have had success, the authors posit that the autoregression shown above, and the speed it allows its models, is the reason why decoder-only models are the most commonly used today.

To understand the YOCO architecture, we have to start out by understanding how it sets out its layers.

For one half of the model, we use one type of attention to generate the vectors needed to fill the KV Cache. Once it crosses into the second half, it will use the KV Cache exclusively for the key and value vectors respectively, now generating the output token embeddings.

This new architecture requires two types of attention efficient self-attention and cross-attention. Well go into each below.

Efficient Self-Attention (ESA) is designed to achieve a constant inference memory. Put differently we want the cache complexity to rely not on the input length but on the number of layers in our block. In the below equation, the authors abstracted ESA, but the remainder of the self-decoder is consistent as shown below.

Lets go through the equation step by step. X^l is our token embedding and Y^l is an intermediary variable used to generate the next token embedding X^l+1. In the equation, ESA is Efficient Self-Attention, LN is the layer normalization function which here was always Root Mean Square Norm (RMSNorm ), and finally SwiGLU. SwiGLU is defined by the below:

Here swish = x*sigmoid (Wg * x), where Wg is a trainable parameter. We then find the element-wise product (Hadamard Product) between that result and X*W1 before then multiplying that whole product by W2. The goal with SwiGLU is to get an activation function that will conditionally pass through different amounts of information through the layer to the next token.

Now that we see how the self-decoder works, lets go into the two ways the authors considered implementing ESA.

First, they considered what is called Gated Retention. Retention and self-attention are admittedly very similar, with the authors of the Retentive Network: A Successor to Transformer for Large Language Models paper saying that the key difference lies in the activation function retention removes softmax allowing for a recurrent formulation. They use this recurrent formulation along with the parallelizability to drive memory efficiencies.

To dive into the mathematical details:

We have our typical matrices of Q, K, and V each of which are multiplied by the learnable weights associated with each matrix. We then find the Hadamard product between the weighted matrices and the scalar . The goal in using is to create exponential decay, while we then use the D matrix to help with casual masking (stopping future tokens from interacting with current tokens) and activation.

Gated Retention is distinct from retention via the value. Here the matrix W is used to allow our ESA to be data-driven.

Sliding Window ESA introduces the idea of limiting how many tokens the attention window should pay attention to. While in regular self-attention all previous tokens are attended to in some way (even if their value is 0), in sliding window ESA, we choose some constant value C that limits the size of these matrices. This means that during inference time the KV cache can be reduced to a constant complexity.

To again dive into the math:

We have our matrices being scaled by their corresponding weights. Next, we compute the head similar to how multi-head attention is computed, where B acts both as a causal map and also to make sure only the tokens C back are attended to.

Read more here:

Understanding You Only Cache Once | by Matthew Gunton | Jun, 2024 - Towards Data Science

Neo4j Announces Collaboration with Snowflake for Advanced AI Insights & Predictive Analytics USA – English – PR Newswire

Neo4j knowledge graphs, graph algorithms, and ML tools are fully integrated within Snowflake - with zero ETL & requiring no specialist graph expertise

SAN FRANCISCO, June 4, 2024 /PRNewswire/ -- Graph database and analytics leader Neo4j today announced at Snowflake's annual user conference, Snowflake Data Cloud Summit 2024, a partnership with Snowflake to bring its fully integrated native graph data science solution within Snowflake AI Data Cloud. The integration enables users to instantly execute more than 65 graph algorithms, eliminates the need to move data out of their Snowflake environment, and empowers them to leverage advanced graph capabilities using the SQL programming languages, environment, and tooling that they already know.

The offering removes complexity, management hurdles, and learning curves for customers seeking graph-enabled insights crucial for AI/ML, predictive analytics, and GenAI applications. The solution features the industry's most extensive library of graph algorithms to identify anomalies and detect fraud, optimize supply chain routes, unify data records, improve customer service, power recommendation engines, and hundreds of other use cases. Anyone who uses Snowflake SQL can get more projects into production faster, accelerate time-to-value, and generate more accurate business insights for better decision-making.

Neo4j graph data science is an analytics and machine learning (ML) solution that identifies and analyzes hidden relationships across billions of data points to improve predictions and discover new insights. Neo4j's library of graph algorithms and ML modeling enables customers to answer questions like what's important, what's unusual, and what's next. Customers can also build knowledge graphs, which capture relationships between entities, ground LLMs in facts, and enable LLMs to reason, infer, and retrieve relevant information more accurately and effectively. Neo4j graph data science customers include Boston Scientific, Novo Nordisk,OrbitMI,and Zenapse, among many others.

"By 2025, graph technologies will be used in 80% of data and analytics innovations up from 10% in 2021 facilitating rapid decision-making across the enterprise," predicts Gartner in its Emerging Tech Impact Radar: Data and Analytics November 20, 2023 report. Gartner also notes, "Data and analytics leaders must leverage the power of large language models (LLMs) with the robustness of knowledge graphs for fault-tolerant AI applications," in the November 2023 report AI Design Patterns for Knowledge Graphs and Generative AI.

Neo4j with Snowflake: new offering capabilities and benefits

Enterprises can harness and scale their secure, governed data natively in Snowflake and augment it with Neo4j's graph analytics and reasoning capabilities for more efficient and timely decision-making, saving customers time and resources.

Supporting quotes

Greg Steck, VP Consumer Analytics, Texas Capital Bank

"At Texas Capital Bank, we're built to help businesses and their leaders succeed. We use Snowflake and Neo4j for critical customer 360 and fraud use cases where relationships matter. We are excited about the potential of this new partnership. The ability to use Neo4j graph data science capabilities within Snowflake will accelerate our data applications and further enhance our ability to bring our customers long-term success."

Jeff Hollan, Head of Applications and Developer Platform, Snowflake

"Integrating Neo4j's proven graph data science capabilities with the Snowflake AI Data Cloud marks a monumental opportunity for our joint customers to optimize their operations. Together, we're equipping organizations with the tools to extract deeper insights, drive innovation at an unprecedented pace, and set a new standard for intelligent decision-making."

Sudhir Hasbe, Chief Product Officer, Neo4j

"Neo4j's leading graph analytics combined with Snowflake's unmatched scalability and performance redefines how customers extract insights from connected data while meeting users in the SQL interfaces where they are today. Our native Snowflake integration empowers users to effortlessly harness the full potential of AI/ML, predictive analytics, and Generative AI for unparalleled insights and decision-making agility."

The new capabilities are available for preview and early access, with general availability later this year on Snowflake Marketplace. For more information, read our blog post or contact us for a preview of Neo4j on Snowflake AI Data Cloud.

To learn more about how organizations are building next gen-apps on Snowflake, click here.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

About Neo4j

Neo4j, the Graph Database & Analytics leader, helps organizations find hidden patterns and relationships across billions of data connections deeply, easily, and quickly. Customers leverage the structure of their connected data to reveal new ways of solving their most pressing business problems, from fraud detection, customer 360, knowledge graphs, supply chain, personalization, IoT, network management, and more even as their data grows. Neo4j's full graph stack delivers powerful native graph storage with native vector search capability, data science, advanced analytics, and visualization, with enterprise-grade security controls, scalable architecture, and ACID compliance. Neo4j's dynamic open-source community brings together over 250,000 developers, data scientists, and architects across hundreds of Fortune 500 companies, government agencies, and NGOs. Visit neo4j.com.

Contact: [emailprotected] neo4j.com/pr

2024 Neo4j, Inc., Neo Technology, Neo4j, Cypher, Neo4j Bloom, Neo4j Graph Data Science Library, Neo4j Aura, and Neo4j AuraDB are registered trademarks or a trademark of Neo4j, Inc. All other marks are owned by their respective companies.

SOURCE Neo4j

Here is the original post:

Neo4j Announces Collaboration with Snowflake for Advanced AI Insights & Predictive Analytics USA - English - PR Newswire

Effective Strategies for Managing ML Initiatives | by Anna Via | Jun, 2024 – Towards Data Science

Embracing uncertainty, right people, and learning from the data Picture by Cottonbro, on Pexels

This blog post is an updated version of part of a conference talk I gave at GOTO Amsterdam last year. The talk is also available to watch online.

Providing value and positive impact through machine learning product initiatives is not an easy job. One of the main reasons for this complexity is the fact that, in ML initiatives developed for digital products, two sources of uncertainty intersect. On one hand, there is the uncertainty related to the ML solution itself (will we be able to predict what we need to predict with good enough quality?). On the other hand, there is the uncertainty related to the impact the whole system will be able to provide (will users like this new functionality? will it really help solve the problem we are trying to solve?).

All this uncertainty means failure in ML product initiatives is something relatively frequent. Still, there are strategies to manage and improve the probabilities of success (or at least to survive through them with dignity!). Starting ML initiatives on the right foot is key. I discussed my top learnings in that area in a previous post: start with the problem (and define how predictions will be used from the beginning), start small (and maintain small if you can), and prioritize the right data (quality, volume, history).

However, starting a project is just the beginning. The challenge to successfully manage an ML initiative and provide a positive impact continues throughout the whole project lifecycle. In this post, Ill share my top three learnings on how to survive and thrive during ML initiatives:

It is really hard (impossible even!) to plan ML initiatives beforehand and to develop them according to that initial plan.

The most popular project plan for ML initiatives is the ML Lifecycle, which splits the phases of an ML project into business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Although these phases are drawn as consecutive steps, in many representations of this lifecycle youll find arrows pointing backward: at any point in the project, you might learn something that forces you to go back to a previous phase.

This translates into projects where it is really hard to know when they will finish. For example, during the evaluation step, you might realize thanks to model explainability techniques that a specific feature wasnt well encoded, and this forces you to go back to the data preparation phase. It could also happen that the model isnt able to predict with the quality you need, and might force you to go back to the beginning in the business understanding phase to redefine the project and business logic.

Whatever your role in an ML initiative or project is, it is key to acknowledge things wont go according to plan, to embrace all this uncertainty from the beginning, and to use it to your advantage. This is important both to managing stakeholders (expectations, trust) and for yourself and the rest of the team (motivation, frustration). How?

Any project starts with people. The right combination of people, skills, perspectives, and a network that empowers you.

The days when Machine Learning (ML) models were confined to the Data Scientists laptop are over. Today, the true potential of ML is realised when models are deployed and integrated into the companys processes. This means more people and skills need to collaborate to make that possible (Data Scientists, Machine Learning Engineers, Backend Developers, Data Engineers).

The first step is identifying the skills and roles that are required to successfully build the end-to-end ML solution. However, more than a group of roles covering a list of skills is required. Having a diverse team that can bring different perspectives and empathize with different user segments has proven to help teams improve their ways of working and build better solutions (why having a diverse team will make your products better).

People dont talk about this enough, but the key people to deliver a project go beyond the team itself. I refer to these other people as the network. The network is people you know are really good at specific things, you trust to ask for help and advice when needed, and can unblock, accelerate, or empower you and the team. The network can be your business stakeholders, manager, staff engineers, user researchers, data scientists from other teams, customer support team Ensure you build your own network and identify who is that ally who you can go to depending on each specific situation or need.

A project is a continuous learning opportunity, and many times learnings and insights come from checking the right data and monitors.

In ML initiatives there are 3 big groups of metrics and measures that can bring a lot of value in terms of learnings and insights: model performance monitoring, service performance, and final impact monitoring. In a previous post I deep dive into this topic.

Checking at the right data and monitors while developing or deploying ML solutions is key to:

Effectively managing ML initiatives from beginning until end is a complex task with multiple dimensions. In this blogpost I shared, based on my experience first as Data Scientist and lately as ML Product Manager, the factors I consider key when dealing with an ML project: embracing uncertainty, surrounding yourself with the right people, and learning from the data.

I hope these insights help you successfully manage your ML initiatives and drive positive impact through them. Stay tuned for more posts about the intersection of Machine Learning and Product Management 🙂

Read the original here:

Effective Strategies for Managing ML Initiatives | by Anna Via | Jun, 2024 - Towards Data Science

Linear Attention Is All You Need. Self-attention at a fraction of the | by Sam Maddrell-Mander | Jun, 2024 – Towards Data Science

This is the kind of thing anyone whos spent much time working with transformers and self-attention will have heard a hundred times. Its both absolutely true, weve all experienced this as you try to increase the context size of your model everything suddenly comes to a grinding halt. But then at the same time, virtually every week it seems, theres a new state of the art model with a new record breaking context length. (Gemini has context length of 2M tokens!)

There are lots of sophisticated methods like RingAttention that make training incredibly long context lengths in large distributed systems possible, but what Im interested in today is a simpler question.

How far can we get with linear attention alone?

This will be a bit of a whistle stop tour, but bear with me as we touch on a few key points before digging into the results.

We can basically summarise the traditional attention mechanism with two key points:

This is expressed in the traditional form as:

It turns out if we ask our mathematician friends we can think about this slightly differently. The softmax can be thought of as one of many ways of describing the probability distribution relating tokens with each other. We can use any similarity measure we like (the dot product being one of the simplest) and so long as we normalise it, were fine.

Its a little sloppy to say this is attention, as in fact its only the attention we know and love when the similarity function is the exponential of the dot product of queries and keys (given below) as we find in the softmax. But this is where it gets interesting, if instead of using this this expression what if we could approximate it?

We can assume there is some feature map phi which gives us a result nearly the same as taking the exponential of the dot product. And crucially, writing the expression like this allows us to play with the order of matrix multiplication operations.

In the paper they propose the Exponential Lineaer Unit (ELU) as the feature map due to a number of useful properties:

We wont spend too much more time on this here, but this is pretty well empirically verified as a fair approximation to the softmax function.

What this allows us to do is change the order of operations. We can take the product of our feature map of K with V first to make a KV block, then the product with Q. The square product becomes over the model dimension size rather than sequence length.

Putting this all together into the linear attention expression gives us:

Where we only need to compute the terms in the brackets once per query row.

(If you want to dig into how the casual masking fits into this and how the gradients are calculated, take a look at the paper. Or watch this space for a future blog.)

The mathematical case is strong, but personally until Ive seen some benchmarks Im always a bit suspicious.

Lets start by looking at the snippets of the code to describe each of these terms. The softmax attention will look very familiar, were not doing anything fancy here.

Then for the linear attention we start by getting the Query, Key and Value matrices, then apply the ELU(x) feature mapping to the Query and Keys. Then we use einsum notation to perform the multiplications.

Seeing this written in code is all well and good, but what does it actually mean experimentally? How much of a performance boost are we talking about here? It can be hard to appreciate the degree of speed up going from a quadratic to a linear bottleneck, so Ive run the following experiemnt.

Were going to to take a single attention layer, with a fixed d_k model dimension of 64, and benchmark the time taken for a forward pass of a 32 batch size set of sequences. The only variable to change will be the sequence length, spanning 128 up to 6000 (the GPT-3 context length for reference if 2048). Each run is done 100 times to get a mean and standard deviation, and experiments are run using an Nvidia T4 GPU.

For such a simple experiment the results are pretty striking.

The results show for even an incredibly small toy example that we get a speed up of up to 60x.

There are a few obvious take-aways here:

For completeness also do not mistake this as saying linear attention is 60x faster for small models. In reality the feed-forward layers are often a bigger chunk of the parameters in a Transformer and the encoding / decoding is often a limiting size component as well. But in this tightly defined problem, pretty impressive!

Continue reading here:

Linear Attention Is All You Need. Self-attention at a fraction of the | by Sam Maddrell-Mander | Jun, 2024 - Towards Data Science

10 Essential DevOps Tools Every Beginner Should Learn – KDnuggets

DevOps (Development Operations) and MLOps (Machine Learning Operations) are almost the same and share a wide variety of tools. As a DevOps engineer, you will deploy, maintain, and monitor applications, whereas as an MLOps engineer, you deploy, manage, and motor manufacturing models into production. So, it is beneficial to learn about DevOps tools as it opens a wide array of job opportunities for you. DevOps refers to a set of practices and tools designed to increase a company's ability to deliver applications and services faster and more efficiently than traditional software development processes.

In this blog, you will learn about essential and popular tools for versioning, CI/CD, testing, automation, containerization, workflow orchestration, cloud, IT management, and monitoring applications in production.

Git is the backbone of modern software development. It is a distributed version control tool that allows multiple developers to work on the same codebase without interfering with each other. Understanding Git is fundamental if you are getting started with software development.

Learn about 14 Essential Git Commands for versioning and collaborating on data science projects.

GitHub Actions simplifies the automation of your software workflows, enabling you to build, test, and deploy your code directly from GitHub with just a few lines of code. As a core function of DevOps engineering, mastering Continuous Integration and Continuous Development (CI/CD) is crucial for success in the field. By learning to automate workflows, generate logs, and troubleshoot issues, you will significantly enhance your job prospects.

Remember it is all about experience and portfolio in the operations related careers.

Learn how to automate machine learning training and evaluation by following GitHub Actions For Machine Learning Beginners.

Selenium is a powerful tool primarily used for automating web browser interactions, allowing you to efficiently test your web application. With just a few lines of code, you can harness the power of Selenium to control a web browser, simulate user interactions, and perform automated testing on your web application, ensuring its functionality, reliability, and performance.

Since many servers use Linux, understanding this operating system can be crucial. Linux commands and scripts form the foundation of many operations in the DevOps world, from basic file manipulation to automating the entire workflow. In fact, many seasoned developers rely heavily on Linux scripting, particularly Bash, to develop custom solutions for data loading, manipulation, automation, logging, and numerous other tasks.

Learn about the most commonly used Linux command by checking out Linux for Data Science cheat sheet.

Familiarity with Cloud Platforms like AWS, Azure, or Google Cloud Platform is essential for landing a job in the industry. The majority of services and applications that we use every day are deployed on the Cloud.

Cloud platforms offer services that can help you deploy, manage, and scale applications. By gaining expertise in Cloud platforms, you'll be able to harness the power of scalability, flexibility, and cost-effectiveness, making you a highly sought-after professional in the job market.

Start the Beginners Guide to Cloud Computing and learn how cloud computing works, top cloud platforms, and applications.

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.

Learn more about Dockers by following the Docker Tutorial for Data Scientists.

Kubernetes is a powerful container orchestration tool that automates the deployment, scaling, and management of containers across diverse environments. As a DevOps engineer, mastering Kubernetes is essential to efficiently scale, distribute, and manage containerized applications, ensuring high availability, reliability, and performance.

Read Kubernetes In Action: Second Edition book to learn about the essential tool for anyone deploying and managing cloud-native applications.

Prometheus is an open-source monitoring and alerting toolkit originally built at SoundCloud. It enables you to monitor a wide range of metrics and receive alerts in real-time, providing unparalleled insights into your system's performance and health. By learning Prometheus, you will be able to identify issues quickly, optimize systems efficiency, and ensure high uptime and availability.

Terraform, an open-source infrastructure as code (IaC) tool developed by HashiCorp, enables you to provision, manage, and version infrastructure resources across multiple cloud and on-premises environments with ease and precision. It supports a wide range of existing service providers, as well as custom in-house solutions, allowing you to create, modify, and track infrastructure changes safely, efficiently, and consistently.

Ansible is a simple, yet powerful, IT automation engine that streamlines provisioning, configuration management, application deployment, orchestration, and a multitude of other IT processes. By automating repetitive tasks, deploying applications, and managing configurations across diverse environments - including cloud, on-premises, and hybrid infrastructures - Ansible empowers users to increase efficiency, reduce errors, and improve overall IT agility.

Learning about these tools is just the starting point for your journey in the world of DevOps. Remember, DevOps is about more than just toolsit is about creating a culture that values collaboration, continuous improvement, and innovation. By mastering these tools, you will build a solid foundation for a successful career in DevOps. So, begin your journey today and take the first step towards a highly paid and exciting career.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Read the original here:

10 Essential DevOps Tools Every Beginner Should Learn - KDnuggets

Bit-LoRA as an application of BitNet and 1.58 bit neural network technologies – Towards Data Science

For experiments I took my proprietary data for a text generation task. The data itself is not that important here, I would just say that it is kind of a small subset of the instructions dataset used to train instruction following LLMs. As the base model I decided to use microsoft/Phi-3-mini-4k-instruct model. I did 3 epochs of LoRA adapters tuning with fp16 mixed precision training using Huggingface Trainer and measured the loss on evaluation. After that I implemented BitNet (replacing the linear layers in LoRA adapters) and 1.58 bit LoRA training and reported the results. I used 4 bit base model quantization with BitsAndBytes during the training in Q-LoRA configuration.

The following LoRA hyperparameters were used: rank = 32, alpha = 16, dropout = 0.05.

3.1. Classic LoRA training

For all LoRA experiments QLoRA approach was used in the part of the base model quantization with NF4 and applying LoRA to all the linear layers of the base model. Optimizer is Paged AdamW with warmup and cosine annealing down to 90% of the maximum learning rate. Maximum learning rate equals 2e-4. Train/test split was random, the test set is 10% from the whole dataset.

3.2. LoRA BitNet implementation

For BitNet LoRA training the approach from BitNet: Scaling 1-bit Transformers for Large Language Models was used with the code for its implementation. According to BitNet paper the weights of the LoRA layers were binarized with scaling:

At the same time activations should be also quantized according to the paper:

According to the formulas provided you can see that each parameter is being transformed with the sign function to be either +1 or -1, those parameters are multiplied by quantized and normalized input X and scaled with the mean absolute value of parameters of the layer. Code implementation:

All the code above is from https://github.com/kyegomez/BitNet GitHub repository.

After LoRA training the adapter weights can be merged with the base model because of the fact that each LoRA adapter is just a pair of linear layers without biases and non-linear activations. Normalization of activations (LN(x)) and their quantization in the approach are making LoRA adapters merger more difficult (after merger LoRA adapter share the same inputs for the linear layer as the base model these layers work with activations without any additional modifications), that is why the additional experiment without normalization and activations quantization was conducted and led to better performance. To do such a modifications we should just modify forward method of the BitLinear class:

Presented code is quantization aware training, because the master weights of each BitLinear layer are still in high precision, while we binarize the weights during the forward pass (the same we can do during the model inference). The only issue here is that we additionally have a scale parameter that is individual to each layer and has high precision.

After we get BitLinear layers we need to replace linear layers in the LoRA adapter with these new linear layers to apply BitLinear modification to classic LoRA. To do so we can rewrite update_layer method of the LoraLayer class (peft.tuners.lora.layer.LoraLayer) with the same method but with BitLinear layers instead of Linear:

After we create such a class we can replace the update_layer method of the original LoraLayer with the new one:

3.3. 1.58 bit LoRA

For this experiment the approach from The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits was used. The conceptual difference is that instead of binarization to +1 and -1 in this paper authors propose to quantize weights to -1, 0 and +1 for better accuracy.

Authors excluded activations scaling from the pipeline that was creating extra difficulties for merger with the base model in our experiments. In our experiments we additionally removed activation quantization from the pipeline to make LoRA adapter merger simpler.

To tune the LoRA adapters with this approach we should simply update the weight_quant function with following:

For the 1.58 bit implementation I used Binary Magic: Building BitNet 1.58bit Using PyTorch from Scratch publication as the starting point.

View original post here:

Bit-LoRA as an application of BitNet and 1.58 bit neural network technologies - Towards Data Science

Going global as Mason Korea’s first computational and data sciences graduate – George Mason University

Traveling abroad has been part of Jimin Jeons life for as long as she can remember. She traveled with her mom during every school vacation, which allowed her to visit 23 countries by the time she was a college student. Being exposed to different cultures from a young age helped her develop a desire to pursue her college education abroad. That brought her to Mason Korea after 12 years of Korean public school education.

While the thought of studying abroad was exciting, I felt burdened by the language barrier to study abroad in the U.S. right after graduating high school, Jeon said. Mason Korea was an alternative to ease that transition by improving my English skills in a more familiar setting in South Korea.

Jeon was part of the first cohort in the newly established Computational and Data Sciences Department at Mason Korea. Although her frequent travels around the world prompted her to major in global affairs, she had her mind set on the world of big data since high school. Thus, after her freshman year, once the new major was opened, she made the jump to the STEM field.

Jeon found she had direct opportunities to engage in data analysis. Her favorite part of Mason Korea was the Career Development Center, which allowed students like her to be exposed to opportunities in data analytics to gain technical hands-on experiences. Her first work experience was through the center as a data science intern at a real estate AI valuation startup during her junior year.

It was a special opportunity to see how the knowledge about programming languages I acquired in the classroom be applied in the real workforce and identify the areas that I need to continue to improve to be a more competent data scientist, said Jeon.

Transitioning to the Fairfax Campus in the fall semester of 2023, Jeon stayed true to her goal of diversifying her experiences. Her last semester at George Mason included working as a teaching assistant for the Computational and Data Sciences Department in the College of Science, performing data cleaning for an on-campus project, and helping students practice their Korean through the language exchange program. She took advantage of the language environment so that she could build her English skills.

Jeon is now a proud graduate in computational and data sciences, one of the few who enrolled in the major in 2020. She is excited about the job opportunities she has and wants to encourage all those who have just closed their four-year journey.

For students just like myself, who have spent their whole life in the Korean education system, going to Mason Korea alone is a challenge, she said. Learning about various topics at a more sophisticated level in a language that you are not familiar with was also not an easy task for me. Yet, the four-year voyage of diverse experiences and success itself shows that I can take on any challenge at any point in my life.

See the rest here:

Going global as Mason Korea's first computational and data sciences graduate - George Mason University

Bolstering environmental data science with equity-centered approaches – EurekAlert

image:

Graphical abstract

Credit: Joe F. Bozeman III

A paradigm shift towards integrating socioecological equity into environmental data science and machine learning (ML) is advocated in a new perspective article (DOI: 10.1007/s11783-024-1825-2)published in the Frontiers of Environmental Science & Engineering. Authored by Joe F. Bozeman III from the Georgia Institute of Technology, the paper emphasizes the importance of understanding and addressing socioecological inequity to enhance the integrity of environmental data science.

This study introduces and validates the Systemic Equity Framework and the Wells-Du Bois Protocol, essential tools for integrating equity in environmental data science and machine learning. These methodologies extend beyond traditional approaches by emphasizing socioecological impacts alongside technical accuracy. The Systemic Equity Framework focuses on the concurrent consideration of distributive, procedural, and recognitional equity, ensuring fair benefits for all communities, particularly the marginalized. It encourages researchers to embed equity throughout the project lifecycle, from inception to implementation. The Wells-Du Bois Protocol offers a structured method to assess and mitigate biases in datasets and algorithms, guiding researchers to critically evaluate potential societal bias reinforcement in their work, which could lead to skewed outcomes.

Highlights

Socioecological inequity must be understood to improve environmental data science.

The Systemic Equity Framework and Wells-Du Bois Protocol mitigate inequity.

Addressing irreproducibility in machine learning is vital for bolstering integrity.

Future directions include policy enforcement and systematic programming.

"Our work is not just about improving technology but ensuring it serves everyone justly," said Joe F. Bozeman III, lead researcher and professor at Georgia Institute of Technology. "Incorporating an equity lens into environmental data science is crucial for the integrity and relevance of our research in real-world settings."

This pioneering research not only highlights existing challenges in environmental data science and machine learning but also offers practical solutions to overcome them. It sets a new standard for conducting research that is just, equitable, and inclusive, thereby paving the way for more responsible and impactful environmental science practices.

Frontiers of Environmental Science & Engineering

Experimental study

Not applicable

Bolstering integrity in environmental data science and machine learning requires understanding socioecological inequity

8-Feb-2024

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.

See the original post here:

Bolstering environmental data science with equity-centered approaches - EurekAlert

Aristotle, AI and the Data Scientist | The ILR School – Cornell University | ILR School

Nearly two and a half millennia since his time, Aristotle and his "virtue ethics" in short, to live a life of good character are every bit as relevant to budding statisticians as the technical skills they learn to build AI models, according to Elizabeth Karns, senior lecturer of statistics and data science at Cornell Bowers CIS and at the ILR School.

An epidemiologist and lawyer, Karns launched Integrated Ethics in Data Science (STSCI 3600), a seven-week course offered twice each spring semester, several years ago in response to what she viewed as a disconnect between statisticians and the high-powered, high-consequence statistical models they were being asked to build.

I started thinking more about algorithms and how we are not preparing students sufficiently to be confronted with workplace pressures to just get the model done Put in the data, don't question it, and just use it, she said.

The problem, as she sees it, is that these models are largely unregulated, have no governing body, and thus skirt rigorous scientific testing and evaluation. Lacking such oversight, ethics and fairness, then, become a matter of discretion on the part of the statisticians developing the models; personal values and virtues are brought into the equation, and this is where Aristotles wisdom proves vital, she said.

At this point in our lack of regulation, we need to depend on ethical people, Karns said. I want students to learn to pause and reflect before making decisions, and to ask, How well does this align with my values? Is this a situation that could lead to problems for the company or users? Is this something I want to be associated with? Thats the core of the class.

For the course, Karns with the help of Cornells Center for Teaching Innovation (CTI) developed an immersive video, Nobodys Fault: An Interactive Experience in Data Science Practice, which challenges students to consider a moral conflict brought about by a bad model.

I tell my students that we're going to be in situations in this class where there's not a clear right or wrong answer, she said. And that's the point to struggle with that ambiguity now and get some comfort in that gray space. That way, when they get out into the workplace, they can be more effective.

To read more about the work Bowers CIS is doing to develop responsible AI, click here. Louis DiPietro is a public relations and content specialist for Cornell Bowers CIS.

More here:

Aristotle, AI and the Data Scientist | The ILR School - Cornell University | ILR School