Category Archives: Data Science

AI In Data Analytics: The 10 Best Tools – AiThority

Google, Intel, IBM, NVIDIA, Amazon, PwC, and the list can go on for the big brands adopting AI in data analysis.

The term artificial intelligence data analysis refers to the application of data science and AI methods to improve data cleansing, inspection, and modeling. The ultimate aim is to find useful data that can back up conclusions and decisions.

AI streamlines operations by automating repetitive tasks. Companies can save time and effort by training a computer program to do repetitive tasks instead of humans. Artificial intelligence (AI) can be programmed to mimic human intellect, which allows it to recognize patterns and produce reliable results.

While learning about this issue, its crucial to understand that data analytics and analysis are not the same thing. Data analytics, a branch of BI, is all about mining data for hidden patterns and trends using machine learning.

Read: 10 AI In Energy Management Trends To Look Out For In 2024

Read: Ranking of Software Companies with the Best and Worst Data Security Perception for 2024

Here are some of the best AI tools to analyze data that are trending in 2024.

With PolymerSearch.com, an easy-to-use business intelligence (BI) tool, you can make professional-quality data visualizations, dashboards, and presentations. And all that without ever touching a piece of code. Many different types of data sources can be easily integrated with Polymer. Integrate data sources such as Google Analytics, Facebook, Google Ads, Google Sheets, Airtable, Shopify, Jira, Stripe, WooCommerce, BigCommerce, and more with ease. You may also upload datasets using XSL or CSV files. After youre linked, Polymers AI will automatically evaluate your data, provide insightful suggestions, and create visually appealing dashboards.

With Tableau, customers can engage with their data without knowing how to code, thanks to its analytics and data visualization capabilities. The user-friendly platform facilitates the real-time creation, modification, and seamless sharing of dashboards and reports among users and teams. As one would expect from a tool of its kind, it supports databases of varying sizes and provides users with several visualization choices to help them make sense of their data.

Another tool that doesnt require coding is MonkeyLearn, which allows customers to see and reorganize their data with AI data analysis features. Depending on the users requirements, the platforms built-in text analysis capabilities may quickly assess and display data. Automatic data sorting by topic or intent, feature extraction from products, and user data extraction are all within the users control with text classifiers and text extractors.

Read: 10 AI In Manufacturing Trends To Look Out For In 2024

One well-known business intelligence product, Microsoft Power BI, also lets users visualize and filter their data to find insights. Users can begin making reports and dashboards right away after importing data from almost any source. In addition to using AI-powered features to analyze data, users can construct machine learning models. Despite its higher price tag, the platform offers native Excel integration and a user interface that is quicker and more responsive than competing options. It also comes with many integrations.

Another data analytics software that helps developers and analysts organize and display data is Sisense. The platforms dynamic user interface and many drag-and-drop capabilities make it simple to use. When working with huge datasets, Sisenses In-Chip technology makes calculation faster by letting users pick between RAM and CPU to handle the data. Users with basic reporting and visualization needs who are working with smaller datasets may find the platform to be a decent fit, despite its restricted visualization features.

Back when it was first released, Microsoft Excel stood head and shoulders above the competition when it came to data analysis. Quickly process and analyze data, make various basic visualizations, and filter data with search boxes and pivot tablesall with Excels Data Analysis Toolpak. Machine learning models, cluster data calculations, and complicated neural networks can all be built in Excel using formulas, and the program even lets users avoid coding altogether. Even without the requirement to code, Excels spreadsheet paradigm and steep learning curve limit its potential.

To help businesses make informed decisions, Akkio provides a platform for data analytics and forecasting. You can qualify, segment, and prioritize your lead lists with the help of this no-coding platforms lead-scoring tools. Using the data at their disposal, users can access future forecasts on nearly any dataset thanks to the forecasting features. Quick and easy to use, the tool has a small but helpful set of connectors for transferring data to and from other programs.

Both technical and non-technical users will appreciate the platforms adaptability and the many data exploration options it comes with. Teams may work together on the platform with ease, utilizing workflows and drag-and-drop editors to customize their data. Despite its robust functionality, QlikView is only a good fit for users who can make full use of the platform due to its costly price and relatively limited AI feature set.

Looker is an additional no-code tool for data analysis and business intelligence that is part of the Google Cloud. It has significant features and integrates with numerous services. Looker can consolidate all of a users data sources into one location, handle massive databases, and let users create numerous dashboards and reports. In addition to having Googles support, the platform has powerful data modeling capabilities. The site is user-friendly, however it lacks customization options and makes report creation a tedious process.

SAP BusinessObjects integrates well with the rest of the SAP suite and enables less technical users to analyze, visualize, and report on their data. It gives people access to AI and ML tools, which they may use for things like data visualization and modeling, better reporting, and dashboarding. Users can also get predictive forecasting features to go further into their data with this tool. Despite the platforms price cuts, the solutions overall costespecially when purchasing platform licensescan be too high for some. Users who are currently customers of SAP and can make use of an AI data tool that integrates with their existing SAP capabilities will find this tool to be more suitable.

Read: Intels Automotive Innovation At CES 2024

We had exclusive commentary from one of our AiThority guest in his byline from Arvind Rao is the Chief Technology Officer, Edge Platforms, EdgeVerve.

Companies are increasingly usingRobotic Process Automation (RPA), easily among the most widely applied tools, to streamline all insurance processes, including marketing, renewals, and sales. A notable instance from the industry demonstrates that Connected Automation can significantly enhance operational efficacy, with one major insurance firm in the US reportedly achieving around 95% efficiency in its processes.

While admittedly RPA has its embedded advantages, it is also critical to leverage cognitive capabilities withAI and analyticsfor a greater degree of efficiency. The inclusion of cognitive software solutions, like natural language processing, can contribute to the transformation of the insurance business from a purely human-oriented domain to an intelligent business landscape.

Clearly, the technological options available at present can only address part of the challenge. Leaders of connected enterprises have the task of persuading insurance firms to move away from traditional methods, and also further raise the level of intelligent technology adoption. While AI is being used in the process, data of low relevance can have a debilitating impact on the decision-making process. Contextual data, incorporation of the organizations policies, and historical interpretation of policy decisions, together with AI, can help throw up more intelligent and accurate recommendations to underwriters in terms of what kind of risk is acceptable.

An estimated $154 b****** was spent worldwide on AI research and implementation in 2023, marking the fastest-ever growth in AI expenditure.

Among artificial intelligence subfields, generative AI is booming. With the rise of chatbots and other forms of direct user interaction with AI, AI systems are rapidly becoming more collaborative.

Read: How to Incorporate Generative AI Into Your Marketing Technology Stack

According to reports, three b****** individuals utilize Googles AI assistant for email assistance and collaboration within the Google Workspace suite. Separately, in just a few months, ChatGPT (a joint venture between OpenAI and Microsoft) amassed more than 100 million users. Another development in artificial intelligence is the displacement of huge corporations by smaller generative models that may be run on desktop computers. Companies no longer need to depend on a third party to develop their AI applications; new approaches in deep learning and neural networks greatly improve the efficiency of running AI models on local devices. This is in contrast to traditional AI models, which consume a lot of resources.

AI uses natural language processing (NLP) and other techniques to analyze unstructured data like text, images, and audio, extracting valuable insights.

Supervised learning involves training an AI model on a labeled dataset, whereas unsupervised learning involves finding patterns and relationships in data without labeled outcomes.

Yes, AI can process and analyze data in real time, enabling immediate insights and timely decision-making.

Neural networks are a type of machine learning model inspired by the human brain, used in tasks like image recognition, speech processing, and complex pattern recognition in data analytics.

AI can automatically generate insightful visualizations, highlight key trends and anomalies, and personalize dashboards based on user preferences and behaviors.

AI models can identify deviations from normal patterns in data, which is useful for detecting fraud, network security breaches, and other irregular activities.

AI helps analyze customer data to understand behavior, predict future actions, personalize marketing, and improve customer satisfaction.

Ethical considerations include ensuring data privacy, avoiding biases in AI models, maintaining transparency in AI decisions, and preventing misuse of AI insights.

Businesses can start by identifying use cases, ensuring data quality, selecting appropriate AI tools, and hiring or training staff with the necessary skills.

Deep learning is a subset of machine learning that uses multi-layered neural networks to analyze large and complex datasets, enabling high-level abstraction and insights.

[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]

Read this article:

AI In Data Analytics: The 10 Best Tools - AiThority

Pentagon pauses development of its go-to data analytics tool – Defense One

Updated: 6:10 p.m. ET.

The Pentagon is pausing development of Advanaits default data-analytics platformso it can be upgraded to handle increased demand, according to an internal email obtained by Defense One.

In the June 3 email, the Pentagons new chief data and artificial intelligence officerdirected developers of the Advana platform to pause much of the active work and additional features until infrastructure changes are complete.

The pause will force users who were banking on forthcomingfeatures, tools, or applications to use existing Advana tools to do their work. The email did not indicate when development might resume.

The Department's demand for enterprise data and analytics services have outgrown the original architecture of the Advana platform, wrote Radha Iyengar Plumb, reflecting a review of the infrastructure, tech tools, and applications that her office relies on soon after taking office in April.

I have directed the team to focus on upgrading the technical framework of Advana to better meet the requirements of our customers and the whole Department, and develop a sustainable enterprise solution for the future, Plumb wrote. To accelerate these platform enhancements, I have also directed the team to reprioritize activities from continuing to build on Advana's current platform to focus on its future platform engineering activities. As part of this, we are looking hard at a variety of applications in the Advana ecosystem.

During the developmental pause, the Pentagon will evaluate whether the applications are stable and should be integrated on the upgraded infrastructure, or if they would fit better elsewhere in DODs tech enterprise.

I understand this will disrupt the planned uses and services for your teams, and I did not make this decision lightly. I understand from my team that this reverses prior commitments madeand want to acknowledge the impacts this may have on your roadmaps and leadership obligations, Plumb wrote. My team and l are dedicated to working with you all to identify alternative hosting environments for your use case or transition your use case to generally available tools on [Advana] until we are able to relook at bringing new vendor pilots onto the future infrastructure environment.

The Advana platform has been an important part of the Pentagons data and analytics efforts, getting its start with financial data management and growing to include other areas across the defense enterprise. In a 2021 memo, Deputy Defense Secretary Kathleen Hicks billed Advana as the Pentagons go-to data analytics platform, and said defense organizations must get executive approval to use other platforms.

The Advancing Analytics (Advana) platform is the single enterprise authoritative data management and analytics platform for the Secretary of Defense, Deputy Secretary of Defense, and Principal Staff Assistants (PSAs), with inputs from all DoD Components, Hicks wrote. The use of other data management and analytics platforms must be approved by the DoD CDO and appropriate Component CDO to ensure adherence to an open data standard architecture.

Researchers have suggested that unified data analytics platforms that tie disparate systems together, like Advana, could help lessen the workloads of Pentagon employees. Moreover, Advana is the authoritative source for reporting Ukraine supplemental funding, and has been praised for its ability to keep track of weapons shipmentsbut criticized for the added burden to soldiers.

Having a mission-ready enterprise analytics infrastructure is critical to the Departments goal of leveraging data and AI from the boardroom to the battlefield, a senior defense official told Defense One.

In May, the Pentagon announced a new initiative, called the Open Data and Applications Government-owned Interoperable Repositories framework, designed to bring data analytics across the defense enterprise. The goal is to build on Hicks 2021 data memo to create a multi-vendor analytics ecosystem.

Advana is an important part of that data infrastructure layer and application environment. Over the past two years, the Advana platform scaled rapidly from initial capability based on pilot projects and prototypes to an enterprise-wide data and analytics environment the Department now uses to inform decision making at all levels, the official said, adding that the planned upgrades will help support Advanas rapid user growth.

These changes will result in improvements, such as accelerating the onboarding of new use cases for enterprise customers and opening up the Advana data platform to third party software development at a larger scale than we enjoy today. Our work to upgrade the platform will impact back-end engineering and, for the most part, no Advana customers will experience a degradation of existing services.

Read this article:

Pentagon pauses development of its go-to data analytics tool - Defense One

EPAM’s Acquisition of Odysseus: Revolutionizing Life Sciences with AI & Analytics – TimesTech

EPAM Systems,a leading digital transformation services and product engineering company announced its acquisition ofOdysseus Data Services, a top health data analytics company.Odysseus will expand EPAMs ability to transform the life sciences value chain through advanced data analytics, data methods and artificial intelligence

We are pleased to have Odysseus join EPAM. With their strong capabilities in Real-World Evidence and Real-World Data the glue between multiple segments of the life sciences value chain our natural synergies make this an exciting time to add this to our portfolio to help our clients achieve better outcomes, said Greg Killian, Senior Vice President of Life Sciences at EPAM. Acquires Odysseus.We see the next wave of innovation based on standardized data powering AI and GenAI to improve life sciences research, clinical studies and post-market surveillance. EPAM Acquires Odysseus. Based on the combined strengths of EPAM and Odysseus, we are well positioned to lead that innovation.

Headquartered in Cambridge, Massachusetts, Odysseus generates healthcare data insights and evidence for clients through skilled data science and analytics, software engineering and data management and ontology and vocabulary management.EPAM Acquires Odysseus. The companys focus on a standardized and systematic approach to healthcare data analytics is the foundation for a better understanding of the inner workings of healthcare interventions in drug treatment, drug safety and efficacy, epidemiological research, provider support, quality measurements and cost reduction. EPAM Acquires Odysseus. Odysseus is an active member of the Observational Health Data Sciences and Informatics (OHDSI) collaborative and is intimately involved in the open standards and open science community through participation in research and development, including OMOP CDM, open source tools and methods.

Were excited to join the EPAM family, said Gregory Klebanov, CEO of Odysseus. With EPAMs strong foundation in AI, EPAM Acquires Odysseus. machine learning, data analytics and data management and cloud infrastructure combined with our healthcare data analytics and Real World Evidence expertise, we can address the whole life sciences value chain more comprehensively.

Continued here:

EPAM's Acquisition of Odysseus: Revolutionizing Life Sciences with AI & Analytics - TimesTech

Improving Business Performance with Machine Learning | by Juan Jose Munoz | Jun, 2024 – Towards Data Science

Because we are using an unsupervised learning algorithm, there is not a widely available measure of accuracy. However, we can use domain knowledge to validate our groups.

Visually inspecting the groups, we can see some benchmarking groups have a mix of Economy and Luxury hotels, which doesn't make business sense as the demand for hotels is fundamentally different.

We can scroll to the data and note some of those differences, but can we come up with our own accuracy measure?

We want to create a function to measure the consistency of the recommended Benchmarking sets across each feature. One way of doing this is by calculating the variance in each feature for each set. For each cluster, we can compute an average of each feature variance, and we can then average each hotel cluster variance to get a total model score.

From our domain knowledge, we know that in order to set up a comparable benchmark set, we need to prioritize hotels in the same Brand, possibly the same market, and the same country, and if we use different markets or countries, then the market tier should be the same.

With that in mind, we want our measure to have a higher penalty for variance in those features. To do so, we will use a weighted average to calculate each benchmark set variance. We will also print the variance of the key features and secondary features separately.

To sum up, to create our accuracy measure, we need to:

To keep our code clean and track our experiments , lets also define a function to store the results of our experiments.

Now that we have a baseline, lets see if we can improve our model.

Up until now, we did not have to know what was going on under the hood when we ran this code:

To improve our model, we will need to understand the model parameters and how we can interact with them to get better benchmark sets.

Lets start by looking at the Scikit Learn documentation and source code:

There are quite a few things going on here.

The Nearestneighbor class inherits fromNeighborsBase, which is the case class for nearest neighbor estimators. This class handles the common functionalities required for nearest-neighbor searches, such as

The Nearestneighbor class also inherits fromKNeighborsMixin and RadiusNeighborsMixinclasses. These Mixin classes add specific neighbor-search functionalities to the Nearestneighbor

Based on our scenario, KNeighborsMixin provides the functionality we need.

We need to understand one key parameter before we can improve our model; this is the distance metric.

The documentation mentions that the NearestNeighbor algorithm uses the Minkowski distance by default and gives us a reference to the SciPy API.

In scipy.spatial.distance, we can see two mathematical representations of "Minkowski" distance:

uv p=( i u iv i p ) 1/p

This formula calculates the p-th root of the sum of powered differences across all elements.

The second mathematical representation of Minkowski distance is:

uv p=( i w i(u iv i p )) 1/p

This is very similar to the first one, but it introduces weights wi to the differences, emphasizing or de-emphasizing specific dimensions. This is useful where certain features are more relevant than others. By default, the setting is None, which gives all features the same weight of 1.0.

This is a great option for improving our model as it allows us to pass domain knowledge to our model and emphasize similarities that are most relevant to users.

If we look at the formulas, we see the parameter. p. This parameter affects the "path" the algorithm takes to calculate the distance. By default, p=2, which represents the Euclidian distance.

You can think of the Euclidian distance as calculating the distance by drawing a straight line between 2 points. This is usally the shortest distance, however, this is not always the most desirable way of calculating the distance, specially in higher dimention spaces. For more information on why this is the case, there is this great paper online: https://bib.dbvis.de/uploadedFiles/155.pdf

Another common value for p is 1. This represents the Manhattan distance. You think of it as the distance between two points measured along a grid-like path.

On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding elements of the vectors. It essentially measures the worst-case difference, making it useful in scenarios where you want to ensure that no single feature varies too much.

By reading and getting familiar with the documentation, we have uncovered a few possible options to improve our model.

By default n_neighbors is 5, however, for our benchmark set, we want to compare each hotel to the 3 most similar hotels. To do so, we need to set n_neighbors = 4 (Subject hotel + 3 peers)

Based on the documentation, we can pass weights to the distance calculation to emphasize the relationship across some features. Based on our domain knowledge, we have identified the features we want to emphasize, in this case, Brand, Market, Country, and Market Tier.

Passing domain knowledge to the model via weights increased the score significantly. Next, lets test the impact of the distance measure.

So far, we have been using the Euclidian distance. Lets see what happens if we use the Manhattan distance instead.

Decreasing p to 1 resulted in some good improvements. Lets see what happens as p approximates infinity.

To use the Chebyshev distance, we will change the metric parameter to Chebyshev. The default sklearn Chebyshev metric doesnt have a weight parameter. To get around this, we will define a custom weighted_chebyshev metric.

We managed to decrease the primary feature variance scores through experimentation.

Lets visualize the results.

Using Manhattan distance with weights seems to give the most accurate benchmark sets according to our needs.

The last step before implementing the benchmark sets would be to examine the sets with the highest Primary features scores and identify what steps to take with them.

These 18 cases will need to be reviewed to ensure the benchmark sets are relevant.

As you can see, with a few lines of code and some understanding of Nearest neighbor search, we managed to set internal benchmark sets. We can now distribute the sets and start measuring hotels' KPIs against their benchmark sets.

You dont always have to focus on the most cutting-edge machine learning methods to deliver value. Very often, simple machine learning can deliver great value.

What are some low-hanging fruits in your business that you could easily tackle with Machine learning?

World Bank. World Development Indicators. Retrieved June 11, 2024, from https://datacatalog.worldbank.org/search/dataset/0038117

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (n.d.). On the Surprising Behavior of Distance Metrics in High Dimensional Space. IBM T. J. Watson Research Center and Institute of Computer Science, University of Halle. Retrieved from https://bib.dbvis.de/uploadedFiles/155.pdf

SciPy v1.10.1 Manual. scipy.spatial.distance.minkowski. Retrieved June 11, 2024, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html

GeeksforGeeks. Haversine formula to find distance between two points on a sphere. Retrieved June 11, 2024, from https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/

scikit-learn. Neighbors Module. Retrieved June 11, 2024, from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors

See the original post:

Improving Business Performance with Machine Learning | by Juan Jose Munoz | Jun, 2024 - Towards Data Science

Multilingual RAG, Algorithmic Thinking, Outlier Detection, and Other Problem-Solving Highlights – Towards Data Science

4 min read

Feeling inspired to write your first TDS post? Were always open to contributions from new authors.

When we think about problem-solving, our focus tends to be on the solving part: the powerful hack, a new magical tool, a few lines of code that make everything click into place. In reality, a lot has to happen for these final touches to workfrom developing a solid understanding of what the problem actually is, to sketching out a workable process that ensures we find consistent success rather than just a temporary band-aid.

Our weekly highlights this week stand out for their holistic approach to finding effective solutions to occasionally thorny challenges. They offer a glimpse into practitioners mindset as they explore their available resources (data, tools, and time, to name a few) and weigh the pros and cons of different workflows. We think they might just inspire you to view whatever project youre working on at the moment from a new perspective. Enjoy your reading!

Read more from the original source:

Multilingual RAG, Algorithmic Thinking, Outlier Detection, and Other Problem-Solving Highlights - Towards Data Science

Computers and chemistry | UDaily – University of Delaware

Soham Jariwala has a unique perspective: He took the Hackathon course in 2022 and this spring is a project mentor alongside Vasu Venkateshwaran from W. L. Gore. & Associates. Jariwala, a doctoral alumnus from the chemical and biomolecular engineering department and now a modeling and simulation scientist at Gore, was not officially part of the NRT program but took the course to gain hands-on experience with using machine learning tools for industry projects, an experience that he said was formative in helping him succeed in his current role.

In a traditional classroom, you have a limited perspective on how projects are conducted in industry, Jariwala said. In the Hackathon class, you have a problem that even the industry experts don't know the answer to. As a team, you bring your own expertise, brainstorm ideas, find the best approach, and learn about other areas in order to reach a decision.

The summer after completing the Hackathon course, students in the NRT-MIDAS program have the option of either completing a summer internship or a two-week teaching workshop.

Alison Shapiro, a chemical engineering doctoral candidate working in the lab of Allan & Myra Ferguson Distinguished ProfessorThomas H. Epps, III, worked at Dow last summer as part of the companys cable and wire department. During her time at Dow, she looked for ways to recycle and more sustainably fabricate the insulating, protective polymers that coat electrical wires and conducted life cycle assessments for candidate materials.

Shapiro said that understanding the vernacular used to talk about chemicals during the Hackathon course was extremely helpful in completing her internship projects.

At Dow, they had the same way of talking about [formulations] as we did in the Hackathon class, which is different from how most academic research was done, she said. That was one of the things that translated over the most, and I initially had no idea that it was going to be so helpful.

Sean Farrington, who is also part of the first NRT cohort, is a doctoral candidate working under Unidel Robert L. Pigford ChairNorman Wagner and Arthur B. Metzner ProfessorAntony Beris. Last summer he completed a teaching workshop developed by NRT core faculty member and associate professorJoshua Enszer, which involved presentations about class preparation and teaching strategies followed by each student delivering a mock lecture and receiving constructive feedback.

Farrington, who is currently a TA in the chemical engineering department, regularly uses a list of action verbs provided by Enszer when preparing to teach. He said that the workshop was invaluable, not only for his current career plans of working in academia but because no matter what job you have, you are always going to have to teach people something, and to do so you need to figure out exactly what your learning outcomes are, he said.

Outside of the programs coursework and professional development activities, NRT-MIDAS also fosters a strong sense of community that reaches across multiple departments.

This includes a biweekly NRT community hour organized by Johnston and Jayaraman, where all members of the NRT-MIDAS community get together during the lunch hour. Along with socializing over pizza, students get to hear invited speakers discuss their research in academic and national laboratories, learn about various STEM careers in industry, publishing, and teaching, and attend professional development workshops on topics such as data ethics, responsible conduct of research, and science communication.

As NRT program coordinator, Johnston plays a key role in helping foster this sense of community, from helping students become comfortable with public speaking and communication during their outreach activities to hosting monthly individual advising meetings with all of the trainees, which Johnston said is the highlight of her week.

Not only do I enjoy getting to know our students, but they also provide valuable information on what they need from us as a program, Johnston said. These meetings have helped influence our professional skill community hour topics and have given us the ability to really cater to the needs of the students in our program.

During the summer, students work in teams to complete an outreach activity that showcases STEM research and data science concepts for a variety of non-scientific audiences.

In the summer of 2022, the first cohort createdvideos to help explain their research and pique other students interest in science.

Read more from the original source:

Computers and chemistry | UDaily - University of Delaware

Qlik meets user needs with realistic approach to generative AI – TechTarget

While some vendors fed the hype, Qlik took a pragmatic approach to generative AI.

The business analytics vendor considered its existing capabilities when generative AI surged in popularity following OpenAI's release of ChatGPT in late 2022. It considered its customers' needs. And it considered what it needed to add to meet those customers' needs.

From there, Qlik came up with a realistic strategy with providing trustworthy data at its core, and one that its users believe in.

"I always check on what other vendors are doing," said Angel Monjars, Qlik platform manager at C40 Cities, a network of nearly 100 cities working together to combat climate change. "I have to stay in touch with everything that's out there. I'm confident that Qlik is on the right track."

At this point, generative AI is nothing new. But after the launch of ChatGPT, it suddenly embodied the technology that could finally enable true natural language processing. It was the technology that, when combined with enterprise data, could reduce and eliminate coding and enable anyone within an organization to work with data rather than a small percentage of specialists.

Within months, data management and analytics vendors such as Microsoft, ThoughtSpot, Tableau, Alteryx and Informatica were among the many to unveil plans to augment their platforms with generative AI, introducing tools that would make their platforms smarter and simpler. But the tools were under development rather than nearing general availability.

Some of those tools eventually made their way through the preview process. For example, Microsoft first unveiled its Copilot for Power BI in May 2023, but didn't make it generally available until one year later. Other generative AI systems, however, after more than a year, are still not generally available.

Qlik, conversely, didn't quickly grab attention when generative AI became the rage. It didn't publicize every time it came up with an idea. It didn't introduce tools in development that promised to eliminate the difficulties that have existed for decades that make data management and analytics specialized skills.

It didn't buy into the hype surrounding generative AI.

"Over the last year, there were a whole heck of a lot of people making a lot of noise about [generative AI]. We did a bit of the opposite," said Nick Magnuson, Qlik's head of AI. "We took a step back and asked some key questions about how we wanted to plan an ecosystem."

Qlik might have lost out on some publicity in the process. But according to Susan Dean, director of business technology at heavy equipment manufacturer Takeuchi U.S., the ecosystem Qlik is developing serves customers' needs. And what it is now revealing related to generative AI is accelerating quickly.

"Definitely," she said when asked whether Qlik is proving the necessary tools for generative AI development, in an interview earlier this week during this year's Qlik user conference. "I'm very excited to see what's next. They just keep getting better. The leaps and bounds from last year's Qlik [conference] to this year's is night and day."

In a sense, Qlik has always been pragmatic. It's part of why the vendor is still relevant 31 years after it was founded, while onetime competitors such as Business Objects, Cognos and Information Builders have been swallowed up by other vendors and essentially disappeared.

Based in King of Prussia, Pa., Qlik is a longtime analytics vendor that has evolved as business intelligence has evolved.

When data was kept on premises and analytics was a specialized skill for experts only, Qlik provided a platform to meet the needs of data analysts. When Tableau rose to prominence touting self-service analytics, Qlik adapted and developed self-service tools to meet the needs of business users.

When cloud computing emerged and enterprises migrated their data operations to the various clouds, Qlik complemented its on-premises capabilities with a cloud-based version of its platform. When that was no longer enough, Qlik identified data integration as an opportunity for growth and over the past six years has methodically built up a data integration suite to complement its analytics capabilities.

Now, the vendor is taking that same strategic approach to AI as it creates an environment for customers to develop AI models and applications and apply generative AI to existing data products.

"Despite what everyone else is doing, what matters most is our customer needs," Magnuson said. "That's where we're focused."

To meet customer need for a trusted foundation for generative AI, Qlik started by combining its existing AI and machine learning capabilities in a single environment it calls Staige.

Unveiled in September, Staige includes AutoML, which is a tool that enables users to perform predictive analysis, and Insight Advisor, a natural language interface that lets customers query and analyze structured data and provides natural language responses with accompanying visuals.

In addition, Staige provides automated code generation capabilities, integrations with generative AI capabilities from third-party providers such as OpenAI and Hugging Face, and an advisory council to provide guidance for customers getting started with AI.

While Qlik's existing capabilities combined with third-party integrations was a start, Qlik needed more capabilities to effectively provide a foundation for developing trusted AI models and applications.

One thing missing was support for unstructured data, such as text and audio files, which is estimated to now make up more than 80% of all data.

To add support for unstructured data, Qlik acquired Kyndi in January and on June 4 unveiled Qlik Answers. The tool, scheduled for general availability this summer, uses retrieval-augmented generation to enable customers to query and analyze unstructured data with natural language in the same way Insight Advisor enables natural language interactions with structured data.

Furthermore, Qlik Answers provides data lineage information so that users can trace the data used to inform the tool's responses and ensure that those responses can be trusted.

Also missing was the data management component -- the integration layer that would enable customers to build applications using quality data from the start rather than look back later to see if the data they already used could be trusted.

Therefore, to complement Answers, the vendor on June 4 unveiled Qlik Talend Cloud, which is likewise definitively scheduled for general availability this summer. The suite, which comes a little more than a year after Qlik completed its acquisition of Talend, is a data integration environment that forms the foundation for ensuring the quality of data used to train generative AI models and applications. Included are governance capabilities and tools such as a trust score.

Combined, Qlik Answers and Qlik Talend Cloud succeed at providing quality data for AI models and applications, according to Mike Leone, an analyst at TechTarget's Enterprise Strategy Group.

"Qlik Answers and Qlik Talend Cloud can work together to deliver a trusted data foundation for AI and fuel innovation from AI," he said.

In addition, the acquisition of Kyndi was critical, Leone continued.

"Kyndi is really that enabling factor for Qlik to extend the delivery of predictive AI and generative AI more broadly and at scale," he said. "I like Qlik's focus on unstructured data as it's often overlooked and underutilized."

Given the foundation that's now been formed by addressing practical needs, customers can begin using Qlik as a foundation for developing generative AI capabilities.

"After we saw what Qlik presented, the possibilities [for using generative AI] are open now," Monjars said.

C40 has been using Insight Advisor and other AI and machine learning tools, but had previously been hesitant to add any generative AI capabilities given its strict data security and data compliance requirements, he continued.

"A very real component we saw is the ability to analyze unstructured data, and there's a lot of knowledge there," Monjars said.

By grounding its generative AI plans in reality rather than making promises it might not be able to keep, Qlik is serving the needs of its customers.

But that pragmatic approach might have come with a cost, according to Donald Farmer, founder and principal of TreeHive Strategy.

Data management rivals Databricks and Snowflake have broadcast seemingly every move while creating environments for AI development. Tech giants AWS, Google and Microsoft have similarly maintained a steady presence in the collective mindset. And many of the more specialized vendors have introduced large swaths of capabilities even when they're only just starting to build them.

Qlik's comparatively quiet approach might have resulted in slow customer growth.

Farmer spent nearly 20 years in product development, including a stint at Qlik as vice president of innovation and design. Now, he heads a consulting firm that works with companies to develop analytics and AI strategies. While the evidence is anecdotal, he noted that Qlik's resonance with potential new customers seems to be slowing.

"Qlik still remains a significant vendor, but with one caveat," Farmer said. "There is very little sign of them gaining traction with greenfield customers. The trickle of new logos is slow. Mostly, they are adding more value to existing clients. But to be fair, they are adding significant value."

Qlik Answers could be a means of adding new users, according to Magnuson.

When Qlik added automated machine learning capabilities with its 2021 acquisition of Big Squid and turned that technology into AutoML, it drew in new customers, he said. Once generally available, Qlik Answers, though tightly integrated with the rest of the Qlik ecosystem, will be available as a standalone tool and could likewise be a way to draw new customers.

"We've made a conscious decision as part of a strategy to offer these solutions to a new buying agenda," Magnuson said. "We know a lot of people are generating new budgets to acquire technology. ... Answers potentially gives us a new opportunity to have a conversation with someone where we can open up a net new opportunity."

Regardless of whether Qlik's practical approach to generative AI brings in a significant number of new customers, what Qlik is doing in terms of technological innovation and support for that technology works for the vendor's existing users, according to Dean.

When Dean joined Takeuchi U.S. in 2018, the company had one analyst keeping its data in Excel spreadsheets. Dean subsequently led the company's transition to Qlik, beginning with a single application. Now, Takeuchi U.S. uses Qlik not only in its administrative office, but also with each of its hundreds of dealers.

But while Takeuchi U.S. -- a subsidiary of Japan-based Takeuchi Manufacturing -- is a sizable organization, it does not boast a big roster of data scientists. Dean is part of a team of three BI analysts.

To do more advanced analytics than just developing dashboards and reports, Takeuchi U.S. needs assistance. One of the main reasons the company has remained with Qlik is the relationship Dean and her team have with the vendor and the support they receive.

"My partnership with Qlik is what keeps me," Dean said. "They work with us."

Takeuchi U.S. now uses AutoML. And once a major undertaking to implement an ERP system is finished next spring, the company wants to build new analytics applications to discover insights related to the performance of its excavators, wheel loaders and other products.

"I'll definitely set up demos with [Qlik] to figure out what will suit us when the time is right," Dean said.

While Qlik Answers and Qlik Talend Cloud in some ways complete the foundation for trusted data that Qlik targeted as its role in enabling generative AI development, the vendor nevertheless plans to develop additional capabilities.

Most notably, it aims to enable customers to query and analyze structured and unstructured data together, according to Magnuson. The acquisition of Kyndi led to Qlik Answers, which enables customers to operationalize unstructured data. But that's just a beginning.

"[Qlik Answers] is starting us on this bigger journey to develop strength and muscle around unstructured content that puts us in a position to provide value to customers by integrating both structured and unstructured data in a single analytics experience," Magnuson said.

Monjars likewise noted that Qlik's enablement of access to unstructured data is significant. From a technological standpoint, Qlik is meeting C40's needs. But where he said he'd like to see more investment from Qlik is in another practical area: increasing awareness.

Qlik provides its own data literacy program. But its customer base is not as big as some other platforms such as Power BI, so it is therefore sometimes difficult to find new employees who don't need to be trained to use Qlik, Monjars noted.

"Qlik is doing what we need, but it's a little hard to find people who are Qlik-trained," he said. "A given professional maybe learns Power BI before they learn Qlik, so that affects the availability of people out there. It would be helpful if Qlik were more of a household name and people made it a priority to learn Qlik coming out of school."

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

More:

Qlik meets user needs with realistic approach to generative AI - TechTarget

Exploring RAG Applications Across Languages: Conversing with the Mishnah – Towards Data Science

15 min read

Im excited to share my journey of building a unique Retrieval-Augmented Generation (RAG) application for interacting with rabbinic texts in this post. MishnahBot aims to provide scholars and everyday users with an intuitive way to query and explore the Mishnah interactively. It can help solve problems such as quickly locating relevant source texts or summarizing a complex debate about religious law, extracting the bottom line.

I had the idea for such a project a few years back, but I felt like the technology wasnt ripe yet. Now, with advancements of large language models, and RAG capabilities, it is pretty straightforward.

This is what our final product will look like, which you could try out here:

RAG applications are gaining significant attention, for improving accuracy and harnessing the reasoning power available in large language models (LLMs). Imagine being able to chat with your library, a collection of car manuals from the same manufacturer, or your tax documents. You can ask questions, and receive answers informed by the wealth of specialized knowledge.

There are two emerging trends in improving language model interactions: Retrieval-Augmented Generation (RAG) and increasing context length, potentially by allowing very long documents as attachments.

One key advantage of RAG systems is cost-efficiency. With RAG, you can handle large contexts without drastically increasing the query cost, which can become expensive. Additionally, RAG is more modular, allowing you to plug and play with different knowledge bases and LLM providers. On the other hand, increasing the context length directly in language models is an exciting development that can enable handling much longer texts in a single interaction.

For this project, I used AWS SageMaker for my development environment, AWS Bedrock to access various LLMs, and the LangChain framework to manage the pipeline. Both AWS services are user-friendly and charge only for the resources used, soIreallyencourageyoutotryitoutyourselves. For Bedrock, youll need to request access to Llama 3 70b Instruct and Claude Sonnet.

Lets open a new Jupyter notebook, and install the packages we will be using:

The dataset for this project is the Mishnah, an ancient Rabbinic text central to Jewish tradition. I chose this text because it is close to my heart and also presents a challenge for language models since it is a niche topic. The dataset was obtained from the Sefaria-Export repository, a treasure trove of rabbinic texts with English translations aligned with the original Hebrew. This alignment facilitates switching between languages in different steps of our RAG application.

Note: The same process applied here can be applied to any other collection of texts of your choosing. This example also demonstrates how RAG technology can be utilized across different languages, as shown with Hebrew in this case.

First we will need to download the relevant data. We will use git sparse-checkout since the full repository is quite large. Open the terminal window and run the following.

And voila! we now have the data files that we need:

Now lets load the documents in our Jupyter notebook environment:

And take a look at the Data:

Looks good, we can move on to the vector database stage.

Next, we vectorize the text and store it in a local ChromaDB. In one sentence, the idea is to represent text as dense vectors arrays of numbers such that texts that are similar semantically will be close to each other in vector space. This is the technology that will enable us to retrieve the relevant passages given a query.

We opted for a lightweight vectorization model, the all-MiniLM-L6-v2, which can run efficiently on a CPU. This model provides a good balance between performance and resource efficiency, making it suitable for our application. While state-of-the-art models like OpenAIs text-embedding-3-large may offer superior performance, they require substantial computational resources, typically running on GPUs.

For more information about embedding models and their performance, you can refer to the MTEB leaderboard which compares various text embedding models on multiple tasks.

Heres the code we will use for vectorizing (should only take a few minutes to run on this dataset on a CPU machine):

With our dataset ready, we can now create our Retrieval-Augmented Generation (RAG) application in English. For this, well use LangChain, a powerful framework that provides a unified interface for various language model operations and integrations, making it easy to build sophisticated applications.

LangChain simplifies the process of integrating different components like language models (LLMs), retrievers, and vector stores. By using LangChain, we can focus on the high-level logic of our application without worrying about the underlying complexities of each component.

Heres the code to set up our RAG system:

Alright! Lets try it out! We will use a query related to the very first paragraphs in the Mishnah.

That seems pretty accurate.

Lets try a more sophisticated question:

Very nice.

I tried that out, heres what I got:

The response is long and not to the point, and the answer that is given is incorrect (reaping is the third type of work in the list, while selecting is the seventh). This is what we call a hallucination.

While Claude is a powerful language model, relying solely on an LLM for generating responses from memorized training data or even using internet searches lacks the precision and control offered by a custom database in a Retrieval-Augmented Generation (RAG) application. Heres why:

This structured retrieval process ensures users receive the most accurate and relevant answers, leveraging both the language generation capabilities of LLMs and the precision of custom data retrieval.

Finally, we will address the challenge of interacting in Hebrew with the original Hebrew text. The same approach can be applied to any other language, as long as you are able to translate the texts to Englishfortheretrievalstage.

Supporting Hebrew interactions adds an extra layer of complexity since embedding models and large language models (LLMs) tend to be stronger in English. While some embedding models and LLMs do support Hebrew, they are often less robust than their English counterparts, especially the smaller embedding models that likely focused more on English during training.

To tackle this, we could train our own Hebrew embedding model. However, another practical approach is to leverage a one-time translation of the text to English and use English embeddings for the retrieval process. This way, we benefit from the strong performance of English models while still supporting Hebrew interactions.

In our case, we already have professional human translations of the Mishnah text into English. We will use this to ensure accurate retrievals while maintaining the integrity of the Hebrew responses. Heres how we can set up this cross-lingual RAG system:

For generation, we use Claude Sonnet since it performs significantly better on Hebrew text compared to Llama 3.

Here is the code implementation:

Lets try it! We will use the same question as before, but in Hebrew this time:

We got an accurate, one word answer to our question. Pretty neat, right?

The translation with Llama 3 Instruct posed several challenges. Initially, the model produced nonsensical results no matter what I tried. (Apparently, Llama 3 instruct is very sensitive to prompts starting with a new line character!)

After resolving that issue, the model tended to output the correct response, but then continue with additional irrelevant text, so stopping the output at a newline character proved effective.

Controlling the output format can be tricky. Some strategies include requesting a JSON format or providing examples with few-shot prompts.

In this project, we also remove vowels from the Hebrew texts since most Hebrew text online does not include vowels, and we want the context for our LLM to be similar to text seen during pretraining.

Building this RAG application has been a fascinating journey, blending the nuances of ancient texts with modern AI technologies. My passion for making the library of ancient rabbinic texts more accessible to everyone (myself included) has driven this project. This technology enables chatting with your library, searching for sources based on ideas, and much more. The approach used here can be applied to other treasured collections of texts, opening up new possibilities for accessing and exploring historical and cultural knowledge.

Its amazing to see how all this can be accomplished in just a few hours, thanks to the powerful tools and frameworks available today. Feel free to check out the full code on GitHub, and play with the MishnahBot website.

Please share your comments and questions, especially if youre trying out something similar. If you want to see more content like this in the future, do let me know!

Follow this link:

Exploring RAG Applications Across Languages: Conversing with the Mishnah - Towards Data Science

Principal Foundation and EVERFI from Blackbaud Reach 26,000 U.S. Students with Growing National Data Science … – PR Newswire

DataSetGo, a first-of-its-kind digital curriculum, is opening doors to data science careers at more than 400 schools, with $50,000 in recent awards to those who show promise in the field

DES MOINES, Iowa, June 4, 2024 /PRNewswire/ -- Principal Foundation, a global nonprofit organization committed to helping people and communities build financially secure futures, and EVERFI from Blackbaud, the leader in powering social impact through education, announce the second and biggest year of DataSetGo, a first-of-its-kind interactive digital curriculum that teaches high school students the fundamentals of data science and its value in daily life, the workforce, and the world.

Since its inception in 2022, DataSetGo has reached over 26,000 high school students in over 400 schools throughout the U.S. In the 2023-2024 academic year, the program reached over 17,000 new students and 200 additional schools across ten states, including New York, Texas, and California.

Last fall, DataSetGo expanded to include DataSetGo Distinguished Scholars, a new national program that equips students to explore postsecondary education and workforce opportunities, including those in the rapidly growing field of data science.

Data science roles can be found in nearly every industry and according to the World Economic Forum, there could be up to 1.4 million new jobs created in data science and data analytics between 2023 and 2027.

"Having learned more about the possible job opportunities has opened several possibilities I had no idea existed. It would be a dream to learn more about data analysis and science to eventually make a profession out of it one day," wrote Jarod Story, a Distinguished Scholar who attends Irving High School near Dallas, Texas.

All the schools that use the DataSetGo curriculum are in low- to moderate-income communities, where educators have given the program high marks. The research-backed curriculum was designed to align with national educational standards and is provided at no cost to educators through a strategic partnership between Principal Foundation andEVERFI.

"This program [DataSetGo] is totally awesome. I'm so overwhelmingly proud of my students," said LaTara Meyers, a teacher at H.D. Woodson High School in Washington, D.C., whose student Amaya Bostic is among the Distinguished Scholars.

The full list of ten Distinguished Scholars was announced in May. Each student received a $5,000 award, for a total of $50,000.

"These impressive students seized the opportunity to learn about data science and the doors it could open for them," said Jo Christine Miles, Director, Principal Foundation and Community Relations, Principal. "We're thrilled to provide awards that will help them continue to pursue their dreams."

"In nearly every industry, data science skills are in high demand. DataSetGo ensures that students are aware of the opportunities and equipped to pursue them because then, their career options are endless," said Ray Martinez, co-founder and President of EVERFIfrom Blackbaud.

Six of the ten Distinguished Scholars ("national award winners") were selected from a national pool of essay submissions that detailed how students plan to apply what they learned through DataSetGo in their careers and lives.

The other four Scholars ("local award winners") were selected from schools in Brooklyn, N.Y.; Minneapolis, Minn.; Washington, D.C.; and Dallas, Texas who participated in DataSetGo virtual or in-person learning sessions. Three of these local award winners were selected throughout the school year.

The final local 2023-2024 Distinguished Scholar was announced at an in-person event in Brooklyn, New York on Tuesday, May 13. Hosted by EVERFI and Principal Foundation, the event celebrated the DataSetGo program and featured sessions with guest speakers who rely on data science in their careers in professional sports, artificial intelligence, and entertainment.

Below are the 2023-2024 DataSetGo Distinguished Scholars. The entry window for the 2024-2025 competition will open September 15.

National award winners:

Local award winners:

For more information about DataSetGo or the DataSetGo Distinguished Scholars award program, visithttps://principal.everfi.com.

About Principal Foundation Principal Financial Group Foundation, Inc. ("Principal Foundation") is a duly recognized 501(c)(3) entity focused on providing philanthropic support to programs that build financial security in the communities where Principal Financial Group, Inc. ("Principal") operates. While Principal Foundation receives funding from Principal, Principal Foundation is a distinct, independent, charitable entity. Principal Foundation does not practice any form of investment advisory services and is not authorized to do so. Established in 1987, Principal Foundation works with organizations that are helping to shape and support the journey to financial security by ensuring access to essential needs, fostering social and cultural connections, and promoting financial inclusion. 3609043-052024

About EVERFI from Blackbaud EVERFI from Blackbaud (NASDAQ: BLKB) is an international technology company driving social impact through education to address the most challenging issues affecting society ranging from financial wellness to mental health to workplace conduct and other critical topics. Founded in 2008, EVERFI's Impact-as-a-Servicesolution and digital educational content have reached more than 45 million learners globally. In 2020, the company was recognized as one of the World's Most Innovative Companies by Fast Company and was featured on Fortune Magazine's Impact 20 List. The company was also named to the 2021 GSV EdTech 150, a list of the most transformative growth companies in digital learning.Blackbaud acquired EVERFI in December 2021. To learn more about EVERFI, please visiteverfi.com or follow us onFacebook,Instagram,LinkedIn, orTwitter @EVERFI.

Blackbaud Forward-looking Statements Except for historical information, all the statements, expectations, and assumptions contained in this news release are forward-looking statements that involve a number of risks and uncertainties, including statements regarding expected benefits of products and product features. Although Blackbaud attempts to be accurate in making these forward-looking statements, it is possible that future circumstances might differ from the assumptions on which such statements are based. In addition, other important factors that could cause results to differ materially include the following: general economic risks; uncertainty regarding increased business and renewals from existing customers; continued success in sales growth; management of integration of acquired companies and other risks associated with acquisitions; risks associated with successful implementation of multiple integrated software products; the ability to attract and retain key personnel; risks associated with management of growth; lengthy sales and implementation cycles, particularly in larger organization; technological changes that make our products and services less competitive; and the other risk factors set forth from time to time in the SEC filings for Blackbaud, copies of which are available free of charge at the SEC's website at http://www.sec.govor upon request from Blackbaud's investor relations department. All Blackbaud product names appearing herein are trademarks or registered trademarks of Blackbaud, Inc.

Media Contact: Zevenia Dennis, [emailprotected]

SOURCE Principal Foundation

Read this article:

Principal Foundation and EVERFI from Blackbaud Reach 26,000 U.S. Students with Growing National Data Science ... - PR Newswire

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 – Towards Data Science

A recent release of Julia such as 1.10 is recommended. For those wanting to use a notebook, the repository shared above also contains a Pluto file, for which Pluto.jl needs to be installed. The input data file for the challenge is unique for everyone and needs to be generated using this Python script. Keep in mind that the file is about 15 GB in size.

Additionally, we will be running benchmarks using the BenchmarkTools.jl package. Note that this does not impact the challenge, its only meant to collect proper statistics to measure and quantify the performance of the Julia code.

The structure of the input data file measurements.txt is as follows (only the first five lines are shown):

The file contains a billion lines (also known as rows or records). Each line has a station name followed by the ; separator and then the recorded temperature. The number of unique stations can be up to 10,000. This implies that the same station appears on multiple lines. We therefore need to collect all the temperatures for all distinct stations in the file, and then calculate the required statistics. Easy, right?

My first attempt was to simply parse the file one line at a time, and then collect the results in a dictionary where every station name is a key and the temperatures are added to a vector of Float64 to be used as the value mapped to the key. I expected this to be slow, but our aim here is to get a number for the baseline performance.

Once the dictionary is ready, we can calculate the necessary statistics:

The output of all the data processing needs to be displayed in a certain format. This is achieved by the following function:

Since this implementation is expected to take long, we can run a simple test by timing @time the following only once:

Our poor mans implementation takes about 526 seconds, so ~ 9 minutes. Its definitely slow, but not that bad at all!

Instead of reading the input file one line at a time, we can try to split it into chunks, and then process all the chunks in parallel. Julia makes it quite easy to implement a parallel for loop. However, we need to take some precautions while doing so.

Before we get to the loop, we first need to figure out how to split the file into chunks. This can be achieved using memory mapping to read the file. Then we need to determine the start and end positions of each chunk. Its important to note that each line in the input data file ends with a new-line character, which has 0x0a as the byte representation. So each chunk should end at that character to ensure that we dont make any errors while parsing the file.

The following function takes the number of chunksnum_chunksas an input argument, then returns an array with each element as the memory mapped chunk.

Since we are parsing station and temperature data from different chunks, we also need to combine them in the end. Each chunk will first be processed into a dictionary as shown before. Then, we combine all chunks as follows:

Now we know how to split the file into chunks, and how we can combine the parsed dictionaries from the chunks at the end. However, the desired speedup can only be obtained if we are also able to process the chunks in parallel. This can be done in a for loop. Note that Julia should be started with multiple threads julia -t 12 for this solution to have any impact.

Additionally, we now want to run a proper statistical benchmark. This means that the challenge should be executed a certain number of times, and we should then be able to visualize the distribution of the results. Thankfully, all of this can be easily done with BenchmarkTools.jl. We cap the maximum number of samples to 10, maximum time for the total run to be 20 minutes and enable garbage collection (will free up memory) to execute between samples. All of this can be brought together in a single script. Note that the input arguments are now the name of the file fname and the number of chunks num_chunks.

Benchmark results along with the inputs used are shown below. Note that we have used 12 threads here.

Multi-threading provides a big performance boost, we are now down to roughly over 2 minutes. Lets see what else we can improve.

Until now, our approach has been to store all the temperatures, and then determine the required statistics (min, mean and max) at the very end. However, the same can already be achieved while we parse every line from the input file. We replace existing values each time a new value which is either larger (for maximum) or smaller (for minimum) is found. For mean, we sum all the values and keep a separate counter as to how many times a temperature for a given station has been found.

Overall, out new logic looks like the following:

The function to combine all the results (from different chunks) also needs to be updated accordingly.

Lets run a new benchmark and see if this change improves the timing.

The median time seems to have improved, but only slightly. Its a win, nonetheless!

Our previous logic to calculate and save the mix, max for temperature can be further simplified. Moreover, following the suggestion from this Julia Discourse post, we can make use of views (using @view ) when parsing the station names and temperature data. This has also been discussed in the Julia performance manual. Since we are using a slice expression for parsing every line, @view helps us avoid the cost of allocation and copying.

Rest of the logic remains the same. Running the benchmark now gives the following:

Whoa! We managed to reach down to almost a minute. It seems switching to a view does make a big difference. Perhaps, there are further tweaks that could be made to improve performance even further. In case you have any suggestions, do let me know in the comments.

Restricting ourselves only to base Julia was fun. However, in the real world, we will almost always be using packages and thus making use of existing efficient implementations for performing the relevant tasks. In our case, CSV.jl (parsing the file in parallel) and DataFrames.jl (performing groupby and combine) will come in handy.

The function below performs the following tasks:

We can now run the benchmark in the same manner as before.

The performance using CSV.jl and DataFrames.jl is quite good, albeit slower than our base Julia implementation. When working on real world projects, these packages are an essential part of a data scientists toolkit. It would thus be interesting to explore if further optimizations are possible using this approach.

See more here:

The One Billion Row Challenge in Julia | by Vikas Negi | Jun, 2024 - Towards Data Science