Page 146«..1020..145146147148..160170..»

Qlik meets user needs with realistic approach to generative AI – TechTarget

While some vendors fed the hype, Qlik took a pragmatic approach to generative AI.

The business analytics vendor considered its existing capabilities when generative AI surged in popularity following OpenAI's release of ChatGPT in late 2022. It considered its customers' needs. And it considered what it needed to add to meet those customers' needs.

From there, Qlik came up with a realistic strategy with providing trustworthy data at its core, and one that its users believe in.

"I always check on what other vendors are doing," said Angel Monjars, Qlik platform manager at C40 Cities, a network of nearly 100 cities working together to combat climate change. "I have to stay in touch with everything that's out there. I'm confident that Qlik is on the right track."

At this point, generative AI is nothing new. But after the launch of ChatGPT, it suddenly embodied the technology that could finally enable true natural language processing. It was the technology that, when combined with enterprise data, could reduce and eliminate coding and enable anyone within an organization to work with data rather than a small percentage of specialists.

Within months, data management and analytics vendors such as Microsoft, ThoughtSpot, Tableau, Alteryx and Informatica were among the many to unveil plans to augment their platforms with generative AI, introducing tools that would make their platforms smarter and simpler. But the tools were under development rather than nearing general availability.

Some of those tools eventually made their way through the preview process. For example, Microsoft first unveiled its Copilot for Power BI in May 2023, but didn't make it generally available until one year later. Other generative AI systems, however, after more than a year, are still not generally available.

Qlik, conversely, didn't quickly grab attention when generative AI became the rage. It didn't publicize every time it came up with an idea. It didn't introduce tools in development that promised to eliminate the difficulties that have existed for decades that make data management and analytics specialized skills.

It didn't buy into the hype surrounding generative AI.

"Over the last year, there were a whole heck of a lot of people making a lot of noise about [generative AI]. We did a bit of the opposite," said Nick Magnuson, Qlik's head of AI. "We took a step back and asked some key questions about how we wanted to plan an ecosystem."

Qlik might have lost out on some publicity in the process. But according to Susan Dean, director of business technology at heavy equipment manufacturer Takeuchi U.S., the ecosystem Qlik is developing serves customers' needs. And what it is now revealing related to generative AI is accelerating quickly.

"Definitely," she said when asked whether Qlik is proving the necessary tools for generative AI development, in an interview earlier this week during this year's Qlik user conference. "I'm very excited to see what's next. They just keep getting better. The leaps and bounds from last year's Qlik [conference] to this year's is night and day."

In a sense, Qlik has always been pragmatic. It's part of why the vendor is still relevant 31 years after it was founded, while onetime competitors such as Business Objects, Cognos and Information Builders have been swallowed up by other vendors and essentially disappeared.

Based in King of Prussia, Pa., Qlik is a longtime analytics vendor that has evolved as business intelligence has evolved.

When data was kept on premises and analytics was a specialized skill for experts only, Qlik provided a platform to meet the needs of data analysts. When Tableau rose to prominence touting self-service analytics, Qlik adapted and developed self-service tools to meet the needs of business users.

When cloud computing emerged and enterprises migrated their data operations to the various clouds, Qlik complemented its on-premises capabilities with a cloud-based version of its platform. When that was no longer enough, Qlik identified data integration as an opportunity for growth and over the past six years has methodically built up a data integration suite to complement its analytics capabilities.

Now, the vendor is taking that same strategic approach to AI as it creates an environment for customers to develop AI models and applications and apply generative AI to existing data products.

"Despite what everyone else is doing, what matters most is our customer needs," Magnuson said. "That's where we're focused."

To meet customer need for a trusted foundation for generative AI, Qlik started by combining its existing AI and machine learning capabilities in a single environment it calls Staige.

Unveiled in September, Staige includes AutoML, which is a tool that enables users to perform predictive analysis, and Insight Advisor, a natural language interface that lets customers query and analyze structured data and provides natural language responses with accompanying visuals.

In addition, Staige provides automated code generation capabilities, integrations with generative AI capabilities from third-party providers such as OpenAI and Hugging Face, and an advisory council to provide guidance for customers getting started with AI.

While Qlik's existing capabilities combined with third-party integrations was a start, Qlik needed more capabilities to effectively provide a foundation for developing trusted AI models and applications.

One thing missing was support for unstructured data, such as text and audio files, which is estimated to now make up more than 80% of all data.

To add support for unstructured data, Qlik acquired Kyndi in January and on June 4 unveiled Qlik Answers. The tool, scheduled for general availability this summer, uses retrieval-augmented generation to enable customers to query and analyze unstructured data with natural language in the same way Insight Advisor enables natural language interactions with structured data.

Furthermore, Qlik Answers provides data lineage information so that users can trace the data used to inform the tool's responses and ensure that those responses can be trusted.

Also missing was the data management component -- the integration layer that would enable customers to build applications using quality data from the start rather than look back later to see if the data they already used could be trusted.

Therefore, to complement Answers, the vendor on June 4 unveiled Qlik Talend Cloud, which is likewise definitively scheduled for general availability this summer. The suite, which comes a little more than a year after Qlik completed its acquisition of Talend, is a data integration environment that forms the foundation for ensuring the quality of data used to train generative AI models and applications. Included are governance capabilities and tools such as a trust score.

Combined, Qlik Answers and Qlik Talend Cloud succeed at providing quality data for AI models and applications, according to Mike Leone, an analyst at TechTarget's Enterprise Strategy Group.

"Qlik Answers and Qlik Talend Cloud can work together to deliver a trusted data foundation for AI and fuel innovation from AI," he said.

In addition, the acquisition of Kyndi was critical, Leone continued.

"Kyndi is really that enabling factor for Qlik to extend the delivery of predictive AI and generative AI more broadly and at scale," he said. "I like Qlik's focus on unstructured data as it's often overlooked and underutilized."

Given the foundation that's now been formed by addressing practical needs, customers can begin using Qlik as a foundation for developing generative AI capabilities.

"After we saw what Qlik presented, the possibilities [for using generative AI] are open now," Monjars said.

C40 has been using Insight Advisor and other AI and machine learning tools, but had previously been hesitant to add any generative AI capabilities given its strict data security and data compliance requirements, he continued.

"A very real component we saw is the ability to analyze unstructured data, and there's a lot of knowledge there," Monjars said.

By grounding its generative AI plans in reality rather than making promises it might not be able to keep, Qlik is serving the needs of its customers.

But that pragmatic approach might have come with a cost, according to Donald Farmer, founder and principal of TreeHive Strategy.

Data management rivals Databricks and Snowflake have broadcast seemingly every move while creating environments for AI development. Tech giants AWS, Google and Microsoft have similarly maintained a steady presence in the collective mindset. And many of the more specialized vendors have introduced large swaths of capabilities even when they're only just starting to build them.

Qlik's comparatively quiet approach might have resulted in slow customer growth.

Farmer spent nearly 20 years in product development, including a stint at Qlik as vice president of innovation and design. Now, he heads a consulting firm that works with companies to develop analytics and AI strategies. While the evidence is anecdotal, he noted that Qlik's resonance with potential new customers seems to be slowing.

"Qlik still remains a significant vendor, but with one caveat," Farmer said. "There is very little sign of them gaining traction with greenfield customers. The trickle of new logos is slow. Mostly, they are adding more value to existing clients. But to be fair, they are adding significant value."

Qlik Answers could be a means of adding new users, according to Magnuson.

When Qlik added automated machine learning capabilities with its 2021 acquisition of Big Squid and turned that technology into AutoML, it drew in new customers, he said. Once generally available, Qlik Answers, though tightly integrated with the rest of the Qlik ecosystem, will be available as a standalone tool and could likewise be a way to draw new customers.

"We've made a conscious decision as part of a strategy to offer these solutions to a new buying agenda," Magnuson said. "We know a lot of people are generating new budgets to acquire technology. ... Answers potentially gives us a new opportunity to have a conversation with someone where we can open up a net new opportunity."

Regardless of whether Qlik's practical approach to generative AI brings in a significant number of new customers, what Qlik is doing in terms of technological innovation and support for that technology works for the vendor's existing users, according to Dean.

When Dean joined Takeuchi U.S. in 2018, the company had one analyst keeping its data in Excel spreadsheets. Dean subsequently led the company's transition to Qlik, beginning with a single application. Now, Takeuchi U.S. uses Qlik not only in its administrative office, but also with each of its hundreds of dealers.

But while Takeuchi U.S. -- a subsidiary of Japan-based Takeuchi Manufacturing -- is a sizable organization, it does not boast a big roster of data scientists. Dean is part of a team of three BI analysts.

To do more advanced analytics than just developing dashboards and reports, Takeuchi U.S. needs assistance. One of the main reasons the company has remained with Qlik is the relationship Dean and her team have with the vendor and the support they receive.

"My partnership with Qlik is what keeps me," Dean said. "They work with us."

Takeuchi U.S. now uses AutoML. And once a major undertaking to implement an ERP system is finished next spring, the company wants to build new analytics applications to discover insights related to the performance of its excavators, wheel loaders and other products.

"I'll definitely set up demos with [Qlik] to figure out what will suit us when the time is right," Dean said.

While Qlik Answers and Qlik Talend Cloud in some ways complete the foundation for trusted data that Qlik targeted as its role in enabling generative AI development, the vendor nevertheless plans to develop additional capabilities.

Most notably, it aims to enable customers to query and analyze structured and unstructured data together, according to Magnuson. The acquisition of Kyndi led to Qlik Answers, which enables customers to operationalize unstructured data. But that's just a beginning.

"[Qlik Answers] is starting us on this bigger journey to develop strength and muscle around unstructured content that puts us in a position to provide value to customers by integrating both structured and unstructured data in a single analytics experience," Magnuson said.

Monjars likewise noted that Qlik's enablement of access to unstructured data is significant. From a technological standpoint, Qlik is meeting C40's needs. But where he said he'd like to see more investment from Qlik is in another practical area: increasing awareness.

Qlik provides its own data literacy program. But its customer base is not as big as some other platforms such as Power BI, so it is therefore sometimes difficult to find new employees who don't need to be trained to use Qlik, Monjars noted.

"Qlik is doing what we need, but it's a little hard to find people who are Qlik-trained," he said. "A given professional maybe learns Power BI before they learn Qlik, so that affects the availability of people out there. It would be helpful if Qlik were more of a household name and people made it a priority to learn Qlik coming out of school."

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

More:

Qlik meets user needs with realistic approach to generative AI - TechTarget

Read More..

Exploring RAG Applications Across Languages: Conversing with the Mishnah – Towards Data Science

15 min read

Im excited to share my journey of building a unique Retrieval-Augmented Generation (RAG) application for interacting with rabbinic texts in this post. MishnahBot aims to provide scholars and everyday users with an intuitive way to query and explore the Mishnah interactively. It can help solve problems such as quickly locating relevant source texts or summarizing a complex debate about religious law, extracting the bottom line.

I had the idea for such a project a few years back, but I felt like the technology wasnt ripe yet. Now, with advancements of large language models, and RAG capabilities, it is pretty straightforward.

This is what our final product will look like, which you could try out here:

RAG applications are gaining significant attention, for improving accuracy and harnessing the reasoning power available in large language models (LLMs). Imagine being able to chat with your library, a collection of car manuals from the same manufacturer, or your tax documents. You can ask questions, and receive answers informed by the wealth of specialized knowledge.

There are two emerging trends in improving language model interactions: Retrieval-Augmented Generation (RAG) and increasing context length, potentially by allowing very long documents as attachments.

One key advantage of RAG systems is cost-efficiency. With RAG, you can handle large contexts without drastically increasing the query cost, which can become expensive. Additionally, RAG is more modular, allowing you to plug and play with different knowledge bases and LLM providers. On the other hand, increasing the context length directly in language models is an exciting development that can enable handling much longer texts in a single interaction.

For this project, I used AWS SageMaker for my development environment, AWS Bedrock to access various LLMs, and the LangChain framework to manage the pipeline. Both AWS services are user-friendly and charge only for the resources used, soIreallyencourageyoutotryitoutyourselves. For Bedrock, youll need to request access to Llama 3 70b Instruct and Claude Sonnet.

Lets open a new Jupyter notebook, and install the packages we will be using:

The dataset for this project is the Mishnah, an ancient Rabbinic text central to Jewish tradition. I chose this text because it is close to my heart and also presents a challenge for language models since it is a niche topic. The dataset was obtained from the Sefaria-Export repository, a treasure trove of rabbinic texts with English translations aligned with the original Hebrew. This alignment facilitates switching between languages in different steps of our RAG application.

Note: The same process applied here can be applied to any other collection of texts of your choosing. This example also demonstrates how RAG technology can be utilized across different languages, as shown with Hebrew in this case.

First we will need to download the relevant data. We will use git sparse-checkout since the full repository is quite large. Open the terminal window and run the following.

And voila! we now have the data files that we need:

Now lets load the documents in our Jupyter notebook environment:

And take a look at the Data:

Looks good, we can move on to the vector database stage.

Next, we vectorize the text and store it in a local ChromaDB. In one sentence, the idea is to represent text as dense vectors arrays of numbers such that texts that are similar semantically will be close to each other in vector space. This is the technology that will enable us to retrieve the relevant passages given a query.

We opted for a lightweight vectorization model, the all-MiniLM-L6-v2, which can run efficiently on a CPU. This model provides a good balance between performance and resource efficiency, making it suitable for our application. While state-of-the-art models like OpenAIs text-embedding-3-large may offer superior performance, they require substantial computational resources, typically running on GPUs.

For more information about embedding models and their performance, you can refer to the MTEB leaderboard which compares various text embedding models on multiple tasks.

Heres the code we will use for vectorizing (should only take a few minutes to run on this dataset on a CPU machine):

With our dataset ready, we can now create our Retrieval-Augmented Generation (RAG) application in English. For this, well use LangChain, a powerful framework that provides a unified interface for various language model operations and integrations, making it easy to build sophisticated applications.

LangChain simplifies the process of integrating different components like language models (LLMs), retrievers, and vector stores. By using LangChain, we can focus on the high-level logic of our application without worrying about the underlying complexities of each component.

Heres the code to set up our RAG system:

Alright! Lets try it out! We will use a query related to the very first paragraphs in the Mishnah.

That seems pretty accurate.

Lets try a more sophisticated question:

Very nice.

I tried that out, heres what I got:

The response is long and not to the point, and the answer that is given is incorrect (reaping is the third type of work in the list, while selecting is the seventh). This is what we call a hallucination.

While Claude is a powerful language model, relying solely on an LLM for generating responses from memorized training data or even using internet searches lacks the precision and control offered by a custom database in a Retrieval-Augmented Generation (RAG) application. Heres why:

This structured retrieval process ensures users receive the most accurate and relevant answers, leveraging both the language generation capabilities of LLMs and the precision of custom data retrieval.

Finally, we will address the challenge of interacting in Hebrew with the original Hebrew text. The same approach can be applied to any other language, as long as you are able to translate the texts to Englishfortheretrievalstage.

Supporting Hebrew interactions adds an extra layer of complexity since embedding models and large language models (LLMs) tend to be stronger in English. While some embedding models and LLMs do support Hebrew, they are often less robust than their English counterparts, especially the smaller embedding models that likely focused more on English during training.

To tackle this, we could train our own Hebrew embedding model. However, another practical approach is to leverage a one-time translation of the text to English and use English embeddings for the retrieval process. This way, we benefit from the strong performance of English models while still supporting Hebrew interactions.

In our case, we already have professional human translations of the Mishnah text into English. We will use this to ensure accurate retrievals while maintaining the integrity of the Hebrew responses. Heres how we can set up this cross-lingual RAG system:

For generation, we use Claude Sonnet since it performs significantly better on Hebrew text compared to Llama 3.

Here is the code implementation:

Lets try it! We will use the same question as before, but in Hebrew this time:

We got an accurate, one word answer to our question. Pretty neat, right?

The translation with Llama 3 Instruct posed several challenges. Initially, the model produced nonsensical results no matter what I tried. (Apparently, Llama 3 instruct is very sensitive to prompts starting with a new line character!)

After resolving that issue, the model tended to output the correct response, but then continue with additional irrelevant text, so stopping the output at a newline character proved effective.

Controlling the output format can be tricky. Some strategies include requesting a JSON format or providing examples with few-shot prompts.

In this project, we also remove vowels from the Hebrew texts since most Hebrew text online does not include vowels, and we want the context for our LLM to be similar to text seen during pretraining.

Building this RAG application has been a fascinating journey, blending the nuances of ancient texts with modern AI technologies. My passion for making the library of ancient rabbinic texts more accessible to everyone (myself included) has driven this project. This technology enables chatting with your library, searching for sources based on ideas, and much more. The approach used here can be applied to other treasured collections of texts, opening up new possibilities for accessing and exploring historical and cultural knowledge.

Its amazing to see how all this can be accomplished in just a few hours, thanks to the powerful tools and frameworks available today. Feel free to check out the full code on GitHub, and play with the MishnahBot website.

Please share your comments and questions, especially if youre trying out something similar. If you want to see more content like this in the future, do let me know!

Follow this link:

Exploring RAG Applications Across Languages: Conversing with the Mishnah - Towards Data Science

Read More..

Principal Foundation and EVERFI from Blackbaud Reach 26,000 U.S. Students with Growing National Data Science … – PR Newswire

DataSetGo, a first-of-its-kind digital curriculum, is opening doors to data science careers at more than 400 schools, with $50,000 in recent awards to those who show promise in the field

DES MOINES, Iowa, June 4, 2024 /PRNewswire/ -- Principal Foundation, a global nonprofit organization committed to helping people and communities build financially secure futures, and EVERFI from Blackbaud, the leader in powering social impact through education, announce the second and biggest year of DataSetGo, a first-of-its-kind interactive digital curriculum that teaches high school students the fundamentals of data science and its value in daily life, the workforce, and the world.

Since its inception in 2022, DataSetGo has reached over 26,000 high school students in over 400 schools throughout the U.S. In the 2023-2024 academic year, the program reached over 17,000 new students and 200 additional schools across ten states, including New York, Texas, and California.

Last fall, DataSetGo expanded to include DataSetGo Distinguished Scholars, a new national program that equips students to explore postsecondary education and workforce opportunities, including those in the rapidly growing field of data science.

Data science roles can be found in nearly every industry and according to the World Economic Forum, there could be up to 1.4 million new jobs created in data science and data analytics between 2023 and 2027.

"Having learned more about the possible job opportunities has opened several possibilities I had no idea existed. It would be a dream to learn more about data analysis and science to eventually make a profession out of it one day," wrote Jarod Story, a Distinguished Scholar who attends Irving High School near Dallas, Texas.

All the schools that use the DataSetGo curriculum are in low- to moderate-income communities, where educators have given the program high marks. The research-backed curriculum was designed to align with national educational standards and is provided at no cost to educators through a strategic partnership between Principal Foundation andEVERFI.

"This program [DataSetGo] is totally awesome. I'm so overwhelmingly proud of my students," said LaTara Meyers, a teacher at H.D. Woodson High School in Washington, D.C., whose student Amaya Bostic is among the Distinguished Scholars.

The full list of ten Distinguished Scholars was announced in May. Each student received a $5,000 award, for a total of $50,000.

"These impressive students seized the opportunity to learn about data science and the doors it could open for them," said Jo Christine Miles, Director, Principal Foundation and Community Relations, Principal. "We're thrilled to provide awards that will help them continue to pursue their dreams."

"In nearly every industry, data science skills are in high demand. DataSetGo ensures that students are aware of the opportunities and equipped to pursue them because then, their career options are endless," said Ray Martinez, co-founder and President of EVERFIfrom Blackbaud.

Six of the ten Distinguished Scholars ("national award winners") were selected from a national pool of essay submissions that detailed how students plan to apply what they learned through DataSetGo in their careers and lives.

The other four Scholars ("local award winners") were selected from schools in Brooklyn, N.Y.; Minneapolis, Minn.; Washington, D.C.; and Dallas, Texas who participated in DataSetGo virtual or in-person learning sessions. Three of these local award winners were selected throughout the school year.

The final local 2023-2024 Distinguished Scholar was announced at an in-person event in Brooklyn, New York on Tuesday, May 13. Hosted by EVERFI and Principal Foundation, the event celebrated the DataSetGo program and featured sessions with guest speakers who rely on data science in their careers in professional sports, artificial intelligence, and entertainment.

Below are the 2023-2024 DataSetGo Distinguished Scholars. The entry window for the 2024-2025 competition will open September 15.

National award winners:

Local award winners:

For more information about DataSetGo or the DataSetGo Distinguished Scholars award program, visithttps://principal.everfi.com.

About Principal Foundation Principal Financial Group Foundation, Inc. ("Principal Foundation") is a duly recognized 501(c)(3) entity focused on providing philanthropic support to programs that build financial security in the communities where Principal Financial Group, Inc. ("Principal") operates. While Principal Foundation receives funding from Principal, Principal Foundation is a distinct, independent, charitable entity. Principal Foundation does not practice any form of investment advisory services and is not authorized to do so. Established in 1987, Principal Foundation works with organizations that are helping to shape and support the journey to financial security by ensuring access to essential needs, fostering social and cultural connections, and promoting financial inclusion. 3609043-052024

About EVERFI from Blackbaud EVERFI from Blackbaud (NASDAQ: BLKB) is an international technology company driving social impact through education to address the most challenging issues affecting society ranging from financial wellness to mental health to workplace conduct and other critical topics. Founded in 2008, EVERFI's Impact-as-a-Servicesolution and digital educational content have reached more than 45 million learners globally. In 2020, the company was recognized as one of the World's Most Innovative Companies by Fast Company and was featured on Fortune Magazine's Impact 20 List. The company was also named to the 2021 GSV EdTech 150, a list of the most transformative growth companies in digital learning.Blackbaud acquired EVERFI in December 2021. To learn more about EVERFI, please visiteverfi.com or follow us onFacebook,Instagram,LinkedIn, orTwitter @EVERFI.

Blackbaud Forward-looking Statements Except for historical information, all the statements, expectations, and assumptions contained in this news release are forward-looking statements that involve a number of risks and uncertainties, including statements regarding expected benefits of products and product features. Although Blackbaud attempts to be accurate in making these forward-looking statements, it is possible that future circumstances might differ from the assumptions on which such statements are based. In addition, other important factors that could cause results to differ materially include the following: general economic risks; uncertainty regarding increased business and renewals from existing customers; continued success in sales growth; management of integration of acquired companies and other risks associated with acquisitions; risks associated with successful implementation of multiple integrated software products; the ability to attract and retain key personnel; risks associated with management of growth; lengthy sales and implementation cycles, particularly in larger organization; technological changes that make our products and services less competitive; and the other risk factors set forth from time to time in the SEC filings for Blackbaud, copies of which are available free of charge at the SEC's website at http://www.sec.govor upon request from Blackbaud's investor relations department. All Blackbaud product names appearing herein are trademarks or registered trademarks of Blackbaud, Inc.

Media Contact: Zevenia Dennis, [emailprotected]

SOURCE Principal Foundation

Read this article:

Principal Foundation and EVERFI from Blackbaud Reach 26,000 U.S. Students with Growing National Data Science ... - PR Newswire

Read More..

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 – Towards Data Science

A recent release of Julia such as 1.10 is recommended. For those wanting to use a notebook, the repository shared above also contains a Pluto file, for which Pluto.jl needs to be installed. The input data file for the challenge is unique for everyone and needs to be generated using this Python script. Keep in mind that the file is about 15 GB in size.

Additionally, we will be running benchmarks using the BenchmarkTools.jl package. Note that this does not impact the challenge, its only meant to collect proper statistics to measure and quantify the performance of the Julia code.

The structure of the input data file measurements.txt is as follows (only the first five lines are shown):

The file contains a billion lines (also known as rows or records). Each line has a station name followed by the ; separator and then the recorded temperature. The number of unique stations can be up to 10,000. This implies that the same station appears on multiple lines. We therefore need to collect all the temperatures for all distinct stations in the file, and then calculate the required statistics. Easy, right?

My first attempt was to simply parse the file one line at a time, and then collect the results in a dictionary where every station name is a key and the temperatures are added to a vector of Float64 to be used as the value mapped to the key. I expected this to be slow, but our aim here is to get a number for the baseline performance.

Once the dictionary is ready, we can calculate the necessary statistics:

The output of all the data processing needs to be displayed in a certain format. This is achieved by the following function:

Since this implementation is expected to take long, we can run a simple test by timing @time the following only once:

Our poor mans implementation takes about 526 seconds, so ~ 9 minutes. Its definitely slow, but not that bad at all!

Instead of reading the input file one line at a time, we can try to split it into chunks, and then process all the chunks in parallel. Julia makes it quite easy to implement a parallel for loop. However, we need to take some precautions while doing so.

Before we get to the loop, we first need to figure out how to split the file into chunks. This can be achieved using memory mapping to read the file. Then we need to determine the start and end positions of each chunk. Its important to note that each line in the input data file ends with a new-line character, which has 0x0a as the byte representation. So each chunk should end at that character to ensure that we dont make any errors while parsing the file.

The following function takes the number of chunksnum_chunksas an input argument, then returns an array with each element as the memory mapped chunk.

Since we are parsing station and temperature data from different chunks, we also need to combine them in the end. Each chunk will first be processed into a dictionary as shown before. Then, we combine all chunks as follows:

Now we know how to split the file into chunks, and how we can combine the parsed dictionaries from the chunks at the end. However, the desired speedup can only be obtained if we are also able to process the chunks in parallel. This can be done in a for loop. Note that Julia should be started with multiple threads julia -t 12 for this solution to have any impact.

Additionally, we now want to run a proper statistical benchmark. This means that the challenge should be executed a certain number of times, and we should then be able to visualize the distribution of the results. Thankfully, all of this can be easily done with BenchmarkTools.jl. We cap the maximum number of samples to 10, maximum time for the total run to be 20 minutes and enable garbage collection (will free up memory) to execute between samples. All of this can be brought together in a single script. Note that the input arguments are now the name of the file fname and the number of chunks num_chunks.

Benchmark results along with the inputs used are shown below. Note that we have used 12 threads here.

Multi-threading provides a big performance boost, we are now down to roughly over 2 minutes. Lets see what else we can improve.

Until now, our approach has been to store all the temperatures, and then determine the required statistics (min, mean and max) at the very end. However, the same can already be achieved while we parse every line from the input file. We replace existing values each time a new value which is either larger (for maximum) or smaller (for minimum) is found. For mean, we sum all the values and keep a separate counter as to how many times a temperature for a given station has been found.

Overall, out new logic looks like the following:

The function to combine all the results (from different chunks) also needs to be updated accordingly.

Lets run a new benchmark and see if this change improves the timing.

The median time seems to have improved, but only slightly. Its a win, nonetheless!

Our previous logic to calculate and save the mix, max for temperature can be further simplified. Moreover, following the suggestion from this Julia Discourse post, we can make use of views (using @view ) when parsing the station names and temperature data. This has also been discussed in the Julia performance manual. Since we are using a slice expression for parsing every line, @view helps us avoid the cost of allocation and copying.

Rest of the logic remains the same. Running the benchmark now gives the following:

Whoa! We managed to reach down to almost a minute. It seems switching to a view does make a big difference. Perhaps, there are further tweaks that could be made to improve performance even further. In case you have any suggestions, do let me know in the comments.

Restricting ourselves only to base Julia was fun. However, in the real world, we will almost always be using packages and thus making use of existing efficient implementations for performing the relevant tasks. In our case, CSV.jl (parsing the file in parallel) and DataFrames.jl (performing groupby and combine) will come in handy.

The function below performs the following tasks:

We can now run the benchmark in the same manner as before.

The performance using CSV.jl and DataFrames.jl is quite good, albeit slower than our base Julia implementation. When working on real world projects, these packages are an essential part of a data scientists toolkit. It would thus be interesting to explore if further optimizations are possible using this approach.

See more here:

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 - Towards Data Science

Read More..

Hridesh Rajan named new dean of Tulane University School of Science and Engineering – Tulane University

June 03, 2024 12:00 PM

|

Hridesh Rajan, Kingland professor and chair of the Department of Computer Science at Iowa State University, has been named the new dean of Tulane University's School of Science and Engineering.

A distinguished scholar and innovative leader, Hridesh brings an impressive breadth of knowledge and experience to this vital role. Bringing Hridesh to Tulane will elevate the School of Science and Engineering to new levels of excellence, President Michael A. Fitts and Provost Robin Forman wrote in a message to the Tulane community.

The message also noted that Rajans selection followed an extensive national search that attracted an exceptionally strong pool of candidates.

Joining Tulane SSE represents a unique opportunity for me to contribute to an institution that aligns with my values and to lead a school poised to make significant contributions to solving the pressing challenges of our time."

Hridesh Rajan, Dean of the School of Science and Engineering

At Iowa State University, Rajan led the development of cutting-edge new degree programs in artificial intelligence and computer science while implementing a cross-campus transdisciplinary research initiative of faculty and students interested in the foundations and applications of data science. He launched numerous other efforts that facilitated interdisciplinary faculty collaboration, guided the successful reaccreditation of ISU's computer science bachelor's program and greatly increased seed grants for graduate research.

Rajan developed new instructional methods that boosted the success rates of students and helped usher in a period of remarkable growth in enrollment, including a 45 percent increase in female students, as well as increases in faculty, staff and research funding.

Rajan, who will join Tulane July 1, cited the School of Science and Engineerings interdisciplinary strengths in areas vital to the future of humanity health, energy, climate science and AI as major draws to the new position.

Joining Tulane SSE represents a unique opportunity for me to contribute to an institution that aligns with my values and to lead a school poised to make significant contributions to solving the pressing challenges of our time through transdisciplinary research, education and community outreach, he said.

Rajan earned both a PhD and an MS in computer science from the University of Virginia, and a Bachelor of Technology degree in computer science and engineering from the Indian Institute of Technology. He arrived at ISU in 2005 and served three years as the founding professor-in-charge of data science programs.

A Fulbright scholar, ACM Distinguished Scientist and fellow of the American Association for the Advancement of Science, Rajan said he recognizes Tulanes unique positioning at the center of health, energy, climate research, data science, artificial intelligence and other fields.

Working closely with Tulane administration, the SSE Board of Advisors, the SSE executive committee, and our dedicated faculty, staff and students,our collective efforts will focus on enhancing interdisciplinary research, fostering innovation, and growing a strong, inclusive community that supports academic excellence and groundbreaking discoveries, he said.

Throughout his career Rajan has displayed a deep commitment to increased access for students from all backgrounds. At ISU he helped increase annual philanthropic commitments by an astounding 643 percent and worked continually to promote more inclusivity, greater representation and higher success rates for all students. His strategic vision led to the creation of an inclusive departmental plan extending through 2032.

An accomplished and award-winning researcher with more than 125 publications, Rajans research interests are focused on data science, software engineering and programming languages where he is most known for his design of the Boa programming language and infrastructure that democratizes access to large-scale data-driven science and engineering.

Rajan will join Tulane as Kimberly Foster, who led the School of Science and Engineering through six successful and transformative years, steps down.

Read more:

Hridesh Rajan named new dean of Tulane University School of Science and Engineering - Tulane University

Read More..

Understanding You Only Cache Once | by Matthew Gunton | Jun, 2024 – Towards Data Science

To understand the changes made here, we first need to discuss the Key-Value Cache. Inside of the transformer we have 3 vectors that are critical for attention to work key, value, and query. From a high level, attention is how we pass along critical information about the previous tokens to the current token so that it can predict the next token. In the example of self-attention with one head, we multiply the query vector on the current token with the key vectors from the previous tokens and then normalize the resulting matrix (the resulting matrix we call the attention pattern). We now multiply the value vectors with the attention pattern to get the updates to each token. This data is then added to the current tokens embedding so that it now has the context to determine what comes next.

We create the attention pattern for every single new token we create, so while the queries tend to change, the keys and the values are constant. Consequently, the current architectures try to reduce compute time by caching the key and value vectors as they are generated by each successive round of attention. This cache is called the Key-Value Cache.

While architectures like encoder-only and encoder-decoder transformer models have had success, the authors posit that the autoregression shown above, and the speed it allows its models, is the reason why decoder-only models are the most commonly used today.

To understand the YOCO architecture, we have to start out by understanding how it sets out its layers.

For one half of the model, we use one type of attention to generate the vectors needed to fill the KV Cache. Once it crosses into the second half, it will use the KV Cache exclusively for the key and value vectors respectively, now generating the output token embeddings.

This new architecture requires two types of attention efficient self-attention and cross-attention. Well go into each below.

Efficient Self-Attention (ESA) is designed to achieve a constant inference memory. Put differently we want the cache complexity to rely not on the input length but on the number of layers in our block. In the below equation, the authors abstracted ESA, but the remainder of the self-decoder is consistent as shown below.

Lets go through the equation step by step. X^l is our token embedding and Y^l is an intermediary variable used to generate the next token embedding X^l+1. In the equation, ESA is Efficient Self-Attention, LN is the layer normalization function which here was always Root Mean Square Norm (RMSNorm ), and finally SwiGLU. SwiGLU is defined by the below:

Here swish = x*sigmoid (Wg * x), where Wg is a trainable parameter. We then find the element-wise product (Hadamard Product) between that result and X*W1 before then multiplying that whole product by W2. The goal with SwiGLU is to get an activation function that will conditionally pass through different amounts of information through the layer to the next token.

Now that we see how the self-decoder works, lets go into the two ways the authors considered implementing ESA.

First, they considered what is called Gated Retention. Retention and self-attention are admittedly very similar, with the authors of the Retentive Network: A Successor to Transformer for Large Language Models paper saying that the key difference lies in the activation function retention removes softmax allowing for a recurrent formulation. They use this recurrent formulation along with the parallelizability to drive memory efficiencies.

To dive into the mathematical details:

We have our typical matrices of Q, K, and V each of which are multiplied by the learnable weights associated with each matrix. We then find the Hadamard product between the weighted matrices and the scalar . The goal in using is to create exponential decay, while we then use the D matrix to help with casual masking (stopping future tokens from interacting with current tokens) and activation.

Gated Retention is distinct from retention via the value. Here the matrix W is used to allow our ESA to be data-driven.

Sliding Window ESA introduces the idea of limiting how many tokens the attention window should pay attention to. While in regular self-attention all previous tokens are attended to in some way (even if their value is 0), in sliding window ESA, we choose some constant value C that limits the size of these matrices. This means that during inference time the KV cache can be reduced to a constant complexity.

To again dive into the math:

We have our matrices being scaled by their corresponding weights. Next, we compute the head similar to how multi-head attention is computed, where B acts both as a causal map and also to make sure only the tokens C back are attended to.

Read more here:

Understanding You Only Cache Once | by Matthew Gunton | Jun, 2024 - Towards Data Science

Read More..

Neo4j Announces Collaboration with Snowflake for Advanced AI Insights & Predictive Analytics USA – English – PR Newswire

Neo4j knowledge graphs, graph algorithms, and ML tools are fully integrated within Snowflake - with zero ETL & requiring no specialist graph expertise

SAN FRANCISCO, June 4, 2024 /PRNewswire/ -- Graph database and analytics leader Neo4j today announced at Snowflake's annual user conference, Snowflake Data Cloud Summit 2024, a partnership with Snowflake to bring its fully integrated native graph data science solution within Snowflake AI Data Cloud. The integration enables users to instantly execute more than 65 graph algorithms, eliminates the need to move data out of their Snowflake environment, and empowers them to leverage advanced graph capabilities using the SQL programming languages, environment, and tooling that they already know.

The offering removes complexity, management hurdles, and learning curves for customers seeking graph-enabled insights crucial for AI/ML, predictive analytics, and GenAI applications. The solution features the industry's most extensive library of graph algorithms to identify anomalies and detect fraud, optimize supply chain routes, unify data records, improve customer service, power recommendation engines, and hundreds of other use cases. Anyone who uses Snowflake SQL can get more projects into production faster, accelerate time-to-value, and generate more accurate business insights for better decision-making.

Neo4j graph data science is an analytics and machine learning (ML) solution that identifies and analyzes hidden relationships across billions of data points to improve predictions and discover new insights. Neo4j's library of graph algorithms and ML modeling enables customers to answer questions like what's important, what's unusual, and what's next. Customers can also build knowledge graphs, which capture relationships between entities, ground LLMs in facts, and enable LLMs to reason, infer, and retrieve relevant information more accurately and effectively. Neo4j graph data science customers include Boston Scientific, Novo Nordisk,OrbitMI,and Zenapse, among many others.

"By 2025, graph technologies will be used in 80% of data and analytics innovations up from 10% in 2021 facilitating rapid decision-making across the enterprise," predicts Gartner in its Emerging Tech Impact Radar: Data and Analytics November 20, 2023 report. Gartner also notes, "Data and analytics leaders must leverage the power of large language models (LLMs) with the robustness of knowledge graphs for fault-tolerant AI applications," in the November 2023 report AI Design Patterns for Knowledge Graphs and Generative AI.

Neo4j with Snowflake: new offering capabilities and benefits

Enterprises can harness and scale their secure, governed data natively in Snowflake and augment it with Neo4j's graph analytics and reasoning capabilities for more efficient and timely decision-making, saving customers time and resources.

Supporting quotes

Greg Steck, VP Consumer Analytics, Texas Capital Bank

"At Texas Capital Bank, we're built to help businesses and their leaders succeed. We use Snowflake and Neo4j for critical customer 360 and fraud use cases where relationships matter. We are excited about the potential of this new partnership. The ability to use Neo4j graph data science capabilities within Snowflake will accelerate our data applications and further enhance our ability to bring our customers long-term success."

Jeff Hollan, Head of Applications and Developer Platform, Snowflake

"Integrating Neo4j's proven graph data science capabilities with the Snowflake AI Data Cloud marks a monumental opportunity for our joint customers to optimize their operations. Together, we're equipping organizations with the tools to extract deeper insights, drive innovation at an unprecedented pace, and set a new standard for intelligent decision-making."

Sudhir Hasbe, Chief Product Officer, Neo4j

"Neo4j's leading graph analytics combined with Snowflake's unmatched scalability and performance redefines how customers extract insights from connected data while meeting users in the SQL interfaces where they are today. Our native Snowflake integration empowers users to effortlessly harness the full potential of AI/ML, predictive analytics, and Generative AI for unparalleled insights and decision-making agility."

The new capabilities are available for preview and early access, with general availability later this year on Snowflake Marketplace. For more information, read our blog post or contact us for a preview of Neo4j on Snowflake AI Data Cloud.

To learn more about how organizations are building next gen-apps on Snowflake, click here.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

About Neo4j

Neo4j, the Graph Database & Analytics leader, helps organizations find hidden patterns and relationships across billions of data connections deeply, easily, and quickly. Customers leverage the structure of their connected data to reveal new ways of solving their most pressing business problems, from fraud detection, customer 360, knowledge graphs, supply chain, personalization, IoT, network management, and more even as their data grows. Neo4j's full graph stack delivers powerful native graph storage with native vector search capability, data science, advanced analytics, and visualization, with enterprise-grade security controls, scalable architecture, and ACID compliance. Neo4j's dynamic open-source community brings together over 250,000 developers, data scientists, and architects across hundreds of Fortune 500 companies, government agencies, and NGOs. Visit neo4j.com.

Contact: [emailprotected] neo4j.com/pr

2024 Neo4j, Inc., Neo Technology, Neo4j, Cypher, Neo4j Bloom, Neo4j Graph Data Science Library, Neo4j Aura, and Neo4j AuraDB are registered trademarks or a trademark of Neo4j, Inc. All other marks are owned by their respective companies.

SOURCE Neo4j

Here is the original post:

Neo4j Announces Collaboration with Snowflake for Advanced AI Insights & Predictive Analytics USA - English - PR Newswire

Read More..

Effective Strategies for Managing ML Initiatives | by Anna Via | Jun, 2024 – Towards Data Science

Embracing uncertainty, right people, and learning from the data Picture by Cottonbro, on Pexels

This blog post is an updated version of part of a conference talk I gave at GOTO Amsterdam last year. The talk is also available to watch online.

Providing value and positive impact through machine learning product initiatives is not an easy job. One of the main reasons for this complexity is the fact that, in ML initiatives developed for digital products, two sources of uncertainty intersect. On one hand, there is the uncertainty related to the ML solution itself (will we be able to predict what we need to predict with good enough quality?). On the other hand, there is the uncertainty related to the impact the whole system will be able to provide (will users like this new functionality? will it really help solve the problem we are trying to solve?).

All this uncertainty means failure in ML product initiatives is something relatively frequent. Still, there are strategies to manage and improve the probabilities of success (or at least to survive through them with dignity!). Starting ML initiatives on the right foot is key. I discussed my top learnings in that area in a previous post: start with the problem (and define how predictions will be used from the beginning), start small (and maintain small if you can), and prioritize the right data (quality, volume, history).

However, starting a project is just the beginning. The challenge to successfully manage an ML initiative and provide a positive impact continues throughout the whole project lifecycle. In this post, Ill share my top three learnings on how to survive and thrive during ML initiatives:

It is really hard (impossible even!) to plan ML initiatives beforehand and to develop them according to that initial plan.

The most popular project plan for ML initiatives is the ML Lifecycle, which splits the phases of an ML project into business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Although these phases are drawn as consecutive steps, in many representations of this lifecycle youll find arrows pointing backward: at any point in the project, you might learn something that forces you to go back to a previous phase.

This translates into projects where it is really hard to know when they will finish. For example, during the evaluation step, you might realize thanks to model explainability techniques that a specific feature wasnt well encoded, and this forces you to go back to the data preparation phase. It could also happen that the model isnt able to predict with the quality you need, and might force you to go back to the beginning in the business understanding phase to redefine the project and business logic.

Whatever your role in an ML initiative or project is, it is key to acknowledge things wont go according to plan, to embrace all this uncertainty from the beginning, and to use it to your advantage. This is important both to managing stakeholders (expectations, trust) and for yourself and the rest of the team (motivation, frustration). How?

Any project starts with people. The right combination of people, skills, perspectives, and a network that empowers you.

The days when Machine Learning (ML) models were confined to the Data Scientists laptop are over. Today, the true potential of ML is realised when models are deployed and integrated into the companys processes. This means more people and skills need to collaborate to make that possible (Data Scientists, Machine Learning Engineers, Backend Developers, Data Engineers).

The first step is identifying the skills and roles that are required to successfully build the end-to-end ML solution. However, more than a group of roles covering a list of skills is required. Having a diverse team that can bring different perspectives and empathize with different user segments has proven to help teams improve their ways of working and build better solutions (why having a diverse team will make your products better).

People dont talk about this enough, but the key people to deliver a project go beyond the team itself. I refer to these other people as the network. The network is people you know are really good at specific things, you trust to ask for help and advice when needed, and can unblock, accelerate, or empower you and the team. The network can be your business stakeholders, manager, staff engineers, user researchers, data scientists from other teams, customer support team Ensure you build your own network and identify who is that ally who you can go to depending on each specific situation or need.

A project is a continuous learning opportunity, and many times learnings and insights come from checking the right data and monitors.

In ML initiatives there are 3 big groups of metrics and measures that can bring a lot of value in terms of learnings and insights: model performance monitoring, service performance, and final impact monitoring. In a previous post I deep dive into this topic.

Checking at the right data and monitors while developing or deploying ML solutions is key to:

Effectively managing ML initiatives from beginning until end is a complex task with multiple dimensions. In this blogpost I shared, based on my experience first as Data Scientist and lately as ML Product Manager, the factors I consider key when dealing with an ML project: embracing uncertainty, surrounding yourself with the right people, and learning from the data.

I hope these insights help you successfully manage your ML initiatives and drive positive impact through them. Stay tuned for more posts about the intersection of Machine Learning and Product Management 🙂

Read the original here:

Effective Strategies for Managing ML Initiatives | by Anna Via | Jun, 2024 - Towards Data Science

Read More..

Linear Attention Is All You Need. Self-attention at a fraction of the | by Sam Maddrell-Mander | Jun, 2024 – Towards Data Science

This is the kind of thing anyone whos spent much time working with transformers and self-attention will have heard a hundred times. Its both absolutely true, weve all experienced this as you try to increase the context size of your model everything suddenly comes to a grinding halt. But then at the same time, virtually every week it seems, theres a new state of the art model with a new record breaking context length. (Gemini has context length of 2M tokens!)

There are lots of sophisticated methods like RingAttention that make training incredibly long context lengths in large distributed systems possible, but what Im interested in today is a simpler question.

How far can we get with linear attention alone?

This will be a bit of a whistle stop tour, but bear with me as we touch on a few key points before digging into the results.

We can basically summarise the traditional attention mechanism with two key points:

This is expressed in the traditional form as:

It turns out if we ask our mathematician friends we can think about this slightly differently. The softmax can be thought of as one of many ways of describing the probability distribution relating tokens with each other. We can use any similarity measure we like (the dot product being one of the simplest) and so long as we normalise it, were fine.

Its a little sloppy to say this is attention, as in fact its only the attention we know and love when the similarity function is the exponential of the dot product of queries and keys (given below) as we find in the softmax. But this is where it gets interesting, if instead of using this this expression what if we could approximate it?

We can assume there is some feature map phi which gives us a result nearly the same as taking the exponential of the dot product. And crucially, writing the expression like this allows us to play with the order of matrix multiplication operations.

In the paper they propose the Exponential Lineaer Unit (ELU) as the feature map due to a number of useful properties:

We wont spend too much more time on this here, but this is pretty well empirically verified as a fair approximation to the softmax function.

What this allows us to do is change the order of operations. We can take the product of our feature map of K with V first to make a KV block, then the product with Q. The square product becomes over the model dimension size rather than sequence length.

Putting this all together into the linear attention expression gives us:

Where we only need to compute the terms in the brackets once per query row.

(If you want to dig into how the casual masking fits into this and how the gradients are calculated, take a look at the paper. Or watch this space for a future blog.)

The mathematical case is strong, but personally until Ive seen some benchmarks Im always a bit suspicious.

Lets start by looking at the snippets of the code to describe each of these terms. The softmax attention will look very familiar, were not doing anything fancy here.

Then for the linear attention we start by getting the Query, Key and Value matrices, then apply the ELU(x) feature mapping to the Query and Keys. Then we use einsum notation to perform the multiplications.

Seeing this written in code is all well and good, but what does it actually mean experimentally? How much of a performance boost are we talking about here? It can be hard to appreciate the degree of speed up going from a quadratic to a linear bottleneck, so Ive run the following experiemnt.

Were going to to take a single attention layer, with a fixed d_k model dimension of 64, and benchmark the time taken for a forward pass of a 32 batch size set of sequences. The only variable to change will be the sequence length, spanning 128 up to 6000 (the GPT-3 context length for reference if 2048). Each run is done 100 times to get a mean and standard deviation, and experiments are run using an Nvidia T4 GPU.

For such a simple experiment the results are pretty striking.

The results show for even an incredibly small toy example that we get a speed up of up to 60x.

There are a few obvious take-aways here:

For completeness also do not mistake this as saying linear attention is 60x faster for small models. In reality the feed-forward layers are often a bigger chunk of the parameters in a Transformer and the encoding / decoding is often a limiting size component as well. But in this tightly defined problem, pretty impressive!

Continue reading here:

Linear Attention Is All You Need. Self-attention at a fraction of the | by Sam Maddrell-Mander | Jun, 2024 - Towards Data Science

Read More..

10 Essential DevOps Tools Every Beginner Should Learn – KDnuggets

DevOps (Development Operations) and MLOps (Machine Learning Operations) are almost the same and share a wide variety of tools. As a DevOps engineer, you will deploy, maintain, and monitor applications, whereas as an MLOps engineer, you deploy, manage, and motor manufacturing models into production. So, it is beneficial to learn about DevOps tools as it opens a wide array of job opportunities for you. DevOps refers to a set of practices and tools designed to increase a company's ability to deliver applications and services faster and more efficiently than traditional software development processes.

In this blog, you will learn about essential and popular tools for versioning, CI/CD, testing, automation, containerization, workflow orchestration, cloud, IT management, and monitoring applications in production.

Git is the backbone of modern software development. It is a distributed version control tool that allows multiple developers to work on the same codebase without interfering with each other. Understanding Git is fundamental if you are getting started with software development.

Learn about 14 Essential Git Commands for versioning and collaborating on data science projects.

GitHub Actions simplifies the automation of your software workflows, enabling you to build, test, and deploy your code directly from GitHub with just a few lines of code. As a core function of DevOps engineering, mastering Continuous Integration and Continuous Development (CI/CD) is crucial for success in the field. By learning to automate workflows, generate logs, and troubleshoot issues, you will significantly enhance your job prospects.

Remember it is all about experience and portfolio in the operations related careers.

Learn how to automate machine learning training and evaluation by following GitHub Actions For Machine Learning Beginners.

Selenium is a powerful tool primarily used for automating web browser interactions, allowing you to efficiently test your web application. With just a few lines of code, you can harness the power of Selenium to control a web browser, simulate user interactions, and perform automated testing on your web application, ensuring its functionality, reliability, and performance.

Since many servers use Linux, understanding this operating system can be crucial. Linux commands and scripts form the foundation of many operations in the DevOps world, from basic file manipulation to automating the entire workflow. In fact, many seasoned developers rely heavily on Linux scripting, particularly Bash, to develop custom solutions for data loading, manipulation, automation, logging, and numerous other tasks.

Learn about the most commonly used Linux command by checking out Linux for Data Science cheat sheet.

Familiarity with Cloud Platforms like AWS, Azure, or Google Cloud Platform is essential for landing a job in the industry. The majority of services and applications that we use every day are deployed on the Cloud.

Cloud platforms offer services that can help you deploy, manage, and scale applications. By gaining expertise in Cloud platforms, you'll be able to harness the power of scalability, flexibility, and cost-effectiveness, making you a highly sought-after professional in the job market.

Start the Beginners Guide to Cloud Computing and learn how cloud computing works, top cloud platforms, and applications.

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.

Learn more about Dockers by following the Docker Tutorial for Data Scientists.

Kubernetes is a powerful container orchestration tool that automates the deployment, scaling, and management of containers across diverse environments. As a DevOps engineer, mastering Kubernetes is essential to efficiently scale, distribute, and manage containerized applications, ensuring high availability, reliability, and performance.

Read Kubernetes In Action: Second Edition book to learn about the essential tool for anyone deploying and managing cloud-native applications.

Prometheus is an open-source monitoring and alerting toolkit originally built at SoundCloud. It enables you to monitor a wide range of metrics and receive alerts in real-time, providing unparalleled insights into your system's performance and health. By learning Prometheus, you will be able to identify issues quickly, optimize systems efficiency, and ensure high uptime and availability.

Terraform, an open-source infrastructure as code (IaC) tool developed by HashiCorp, enables you to provision, manage, and version infrastructure resources across multiple cloud and on-premises environments with ease and precision. It supports a wide range of existing service providers, as well as custom in-house solutions, allowing you to create, modify, and track infrastructure changes safely, efficiently, and consistently.

Ansible is a simple, yet powerful, IT automation engine that streamlines provisioning, configuration management, application deployment, orchestration, and a multitude of other IT processes. By automating repetitive tasks, deploying applications, and managing configurations across diverse environments - including cloud, on-premises, and hybrid infrastructures - Ansible empowers users to increase efficiency, reduce errors, and improve overall IT agility.

Learning about these tools is just the starting point for your journey in the world of DevOps. Remember, DevOps is about more than just toolsit is about creating a culture that values collaboration, continuous improvement, and innovation. By mastering these tools, you will build a solid foundation for a successful career in DevOps. So, begin your journey today and take the first step towards a highly paid and exciting career.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Read the original here:

10 Essential DevOps Tools Every Beginner Should Learn - KDnuggets

Read More..