Category Archives: Data Science
Get on Track for a Data Science Career With $15 Off This Bundle – PCMag
A career in data science can be lucrative, but it takes more than a few advanced math classes. Those who excel in this field might be expected to work with and even create the next generation's AI algorithms, so they need to know their way around software like Excel and Python. The Complete Excel, VBA and Data Science Certification Training Bundle is a great resume builder when it comes to big data, and it's now on sale for $34.97 for a limited time.
In this bundle, you'll find more than 50 hours of training on the software and code that data scientists use the most. That includes code-free platforms like Amazon Honeycomb as well as Python and Excel, and beginners will find intro courses on them all that will get them ready for the headier concepts in later, more advanced lessons. At the end of each, a certification from Mammoth Interactive will let you show your newfound knowledge to potential employers.
Get the Complete Excel, VBA and Data Science Certification Training Bundle for $34.97 (reg. $49.99) through Jan. 1.
Prices subject to change. PCMag editors select and review products independently. If you buy through StackSocial affiliate links, we may earn commissions, which help support our testing.
Sign up for our expertly curated Daily Deals newsletter for the best bargains youll find anywhere.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.
Read more:
Get on Track for a Data Science Career With $15 Off This Bundle - PCMag
Geospatial Indexing Explained: A Comparison of Geohash, S2, and H3 – Towards Data Science
Geospatial indexing, or Geocoding, is the process of indexing latitude-longitude pairs to small subdivisions of geographical space, and it is a technique that we data scientists often find ourselves using when faced with geospatial data.
Though the first popular geospatial indexing technique Geohash was invented as recently as 2008, indexing latitude-longitude pairs to manageable subdidivisions of space is hardly a new concept. Governments have been breaking up their land into states, provinces, counties, and postal codes for centuries for all sorts of applications, such as taking censuses and aggregating votes for elections.
Rather than using the manual techniques used by governments, we data scientists use modern computational techniques to execute such spatial subdividing, and we do so for our own purposes: analytics, feature-engineering, granular AB testing by geographic subdivision, indexing geospatial databases, and more.
Geospatial indexing is a thoroughly developed area of computer science, and geospatial indexing tools can bring a lot of power and richness to our models and analyses. What makes geospatial indexing techniques further exciting, is that a look under their proverbial hoods reveals eclectic amalgams of other mathematical tools, such as space-filling curves, map projections, tessellations, and more!
This post will explore three of todays most popular geospatial indexing tools where they come from, how they work, what makes them different from one another, and how you can get started using them. In chronological order, and from least to greatest complexity, well look at:
It will conclude by comparing these tools, and recommending when you might want to use one over another.
Before getting started, note that these tools include much functionality beyond basic geospatial indexing: polygon intersection, polygon containment checks, line containment checks, generating cell-coverings of geographical spaces, retrieval of geospatially indexed cells neighbors, and more. This post, however, focuses strictly on geospatial indexing functionality.
Geohash, invented in 2008 by Gustavo Niemeyer, is the earliest created geospatial indexing tool [1]. It enables its users to map
Continue reading here:
Geospatial Indexing Explained: A Comparison of Geohash, S2, and H3 - Towards Data Science
Ted Kaouk, John Coughlan Take on Leadership Roles at CFTC Division of Data – Executive Gov
The Commodity Futures Trading Commission has expanded the leadership team of its Division of Data, or DOD, with the appointment of Ted Kaouk as chief data officer and director of DOD and John Coughlan as chief data scientist.
CFTC said Kaouk will oversee the DODs data integration efforts and facilitate collaboration among offices and divisions to help the CFTC leadership make informed policy decisions.
He was chief data officer at the Office of Personnel Management and oversaw the development of OPMs inaugural human capital data strategy.
Kaouk also held the same position at the Department of Agriculture and served as the first chair of the Federal Chief Data Officers Council.
Meanwhile, Coughlan has held data science and analytical roles at CFTC for eight years. Before joining the DOD, he was a market analyst within the Division of Market Oversights Market Intelligence Branch.
In his new role, Coughlan will help advance the DODs data science capabilities and drive the adoption of artificial intelligence-powered tools across the agency.
Read more here:
Ted Kaouk, John Coughlan Take on Leadership Roles at CFTC Division of Data - Executive Gov
18 Data Science Tools to Consider Using in 2024 – TechTarget
The increasing volume and complexity of enterprise data, and its central role in decision-making and strategic planning, are driving organizations to invest in the people, processes and technologies they need to gain valuable business insights from their data assets. That includes a variety of tools commonly used in data science applications.
In an annual survey conducted by consulting firm Wavestone's NewVantage Partners unit, 87.8% of chief data officers and other IT and business executives from 116 large organizations said their investments in data and analytics initiatives, such as data science programs, increased during 2022. Looking ahead, 83.9% expect further increases this year despite the current economic conditions, according to a report on the Data and Analytics Leadership Executive Survey that was published in January 2023.
The survey also found that 91.9% of the responding organizations got measurable business value from their data and analytics investments in 2022 and that 98.2% expect their planned 2023 spending to pay off. Many strategic analytics goals remain aspirational, though: Only 40.8% of the respondents said they're competing on data and analytics, and just 23.9% have created a data-driven organization.
As data science teams build their portfolios of enabling technologies to help achieve those analytics goals, they can choose from a wide selection of tools and platforms. Here's a rundown of 18 top data science tools that may be able to aid you in the analytics process, listed in alphabetical order with details on their features and capabilities -- and some potential limitations.
Apache Spark is an open source data processing and analytics engine that can handle large amounts of data -- upward of several petabytes, according to proponents. Spark's ability to rapidly process data has fueled significant growth in the use of the platform since it was created in 2009, helping to make the Spark project one of the largest open source communities among big data technologies.
Due to its speed, Spark is well suited for continuous intelligence applications powered by near-real-time processing of streaming data. However, as a general-purpose distributed processing engine, Spark is equally suited for extract, transform and load uses and other SQL batch jobs. In fact, Spark initially was touted as a faster alternative to the MapReduce engine for batch processing in Hadoop clusters.
Spark is still often used with Hadoop but can also run standalone against other file systems and data stores. It features an extensive set of developer libraries and APIs, including a machine learning library and support for key programming languages, making it easier for data scientists to quickly put the platform to work.
Another open source tool, D3.js is a JavaScript library for creating custom data visualizations in a web browser. Commonly known as D3, which stands for Data-Driven Documents, it uses web standards, such as HTML, Scalable Vector Graphics and CSS, instead of its own graphical vocabulary. D3's developers describe it as a dynamic and flexible tool that requires a minimum amount of effort to generate visual representations of data.
D3.js lets visualization designers bind data to documents via the Document Object Model and then use DOM manipulation methods to make data-driven transformations to the documents. First released in 2011, it can be used to design various types of data visualizations and supports features such as interaction, animation, annotation and quantitative analysis.
D3 includes more than 30 modules and 1,000 visualization methods, making it complicated to learn. In addition, many data scientists don't have JavaScript skills. As a result, they may be more comfortable with a commercial visualization tool, like Tableau, leaving D3 to be used more by data visualization developers and specialists who are also members of data science teams.
IBM SPSS is a family of software for managing and analyzing complex statistical data. It includes two primary products: SPSS Statistics, a statistical analysis, data visualization and reporting tool, and SPSS Modeler, a data science and predictive analytics platform with a drag-and-drop UI and machine learning capabilities.
SPSS Statistics covers every step of the analytics process, from planning to model deployment, and enables users to clarify relationships between variables, create clusters of data points, identify trends and make predictions, among other capabilities. It can access common structured data types and offers a combination of a menu-driven UI, its own command syntax and the ability to integrate R and Python extensions, plus features for automating procedures and import-export ties to SPSS Modeler.
Created by SPSS Inc. in 1968, initially with the name Statistical Package for the Social Sciences, the statistical analysis software was acquired by IBM in 2009, along with the predictive modeling platform, which SPSS had previously bought. While the product family is officially called IBM SPSS, the software is still usually known simply as SPSS.
Julia is an open source programming language used for numerical computing, as well as machine learning and other kinds of data science applications. In a 2012 blog post announcing Julia, its four creators said they set out to design one language that addressed all of their needs. A big goal was to avoid having to write programs in one language and convert them to another for execution.
To that end, Julia combines the convenience of a high-level dynamic language with performance that's comparable to statically typed languages, such as C and Java. Users don't have to define data types in programs, but an option allows them to do so. The use of a multiple dispatch approach at runtime also helps to boost execution speed.
Julia 1.0 became available in 2018, nine years after work began on the language; the latest version is 1.9.4, with a 1.10 update now available for release candidate testing. The documentation for Julia notes that, because its compiler differs from the interpreters in data science languages like Python and R, new users "may find that Julia's performance is unintuitive at first." But, it claims, "once you understand how Julia works, it's easy to write code that's nearly as fast as C."
An open source web application, Jupyter Notebook enables interactive collaboration among data scientists, data engineers, mathematicians, researchers and other users. It's a computational notebook tool that can be used to create, edit and share code, as well as explanatory text, images and other information. For example, Jupyter users can add software code, computations, comments, data visualizations and rich media representations of computation results to a single document, known as a notebook, which can then be shared with and revised by colleagues.
As a result, notebooks "can serve as a complete computational record" of interactive sessions among the members of data science teams, according to Jupyter Notebook's documentation. The notebook documents are JSON files that have version control capabilities. In addition, a Notebook Viewer service enables them to be rendered as static webpages for viewing by users who don't have Jupyter installed on their systems.
Jupyter Notebook's roots are in the programming language Python -- it originally was part of the IPython interactive toolkit open source project before being split off in 2014. The loose combination of Julia, Python and R gave Jupyter its name; along with supporting those three languages, Jupyter has modular kernels for dozens of others.The open source project also includes JupyterLab, a newer web-based UI that's more flexible and extensible than the original one.
Keras is a programming interface that enables data scientists to more easily access and use the TensorFlow machine learning platform. It's an open source deep learning API and framework written in Python that runs on top of TensorFlow and is now integrated into that platform. Keras previously supported multiple back ends but was tied exclusively to TensorFlow starting with its 2.4.0 release in June 2020.
As a high-level API, Keras was designed to drive easy and fast experimentation that requires less coding than other deep learning options. The goal is to accelerate the implementation of machine learning models -- in particular, deep learning neural networks -- through a development process with "high iteration velocity," as the Keras documentation puts it.
The Keras framework includes a sequential interface for creating relatively simple linear stacks of layers with inputs and outputs, as well as a functional API for building more complex graphs of layers or writing deep learning models from scratch. Keras models can run on CPUs or GPUs and be deployed across multiple platforms, including web browsers and Android and iOS mobile devices.
Developed and sold by software vendor MathWorks since 1984, Matlab is a high-level programming language and analytics environment for numerical computing, mathematical modeling and data visualization. It's primarily used by conventional engineers and scientists to analyze data, design algorithms and develop embedded systems for wireless communications, industrial control, signal processing and other applications, often in concert with a companion Simulink tool that offers model-based design and simulation capabilities.
While Matlab isn't as widely used in data science applications as languages like Python, R and Julia, it does support machine learning and deep learning, predictive modeling, big data analytics, computer vision and other work done by data scientists. Data types and high-level functions built into the platform are designed to speed up exploratory data analysis and data preparation in analytics applications.
Considered relatively easy to learn and use, Matlab -- which is short for matrix laboratory -- includes prebuilt applications but also enables users to build their own. It also has a library of add-on toolboxes with discipline-specific software and hundreds of built-in functions, including the ability to visualize data in 2D and 3D plots.
Matplotlib is an open source Python plotting library that's used to read, import and visualize data in analytics applications. Data scientists and other users can create static, animated and interactive data visualizations with Matplotlib, using it in Python scripts, the Python and IPython shells, Jupyter Notebook, web application servers and various GUI toolkits.
The library's large code base can be challenging to master, but it's organized in a hierarchical structure that's designed to enable users to build visualizations mostly with high-level commands. The top component in the hierarchy is pyplot, a module that provides a "state-machine environment" and a set of simple plotting functions similar to the ones in Matlab.
First released in 2003, Matplotlib also includes an object-oriented interface that can be used together with pyplot or on its own; it supports low-level commands for more complex data plotting. The library is primarily focused on creating 2D visualizations but offers an add-on toolkit with 3D plotting features.
Short for Numerical Python, NumPy is an open source Python library that's used widely in scientific computing, engineering, and data science and machine learning applications. The library consists of multidimensional array objects and routines for processing those arrays to enable various mathematical and logic functions. It also supports linear algebra, random number generation and other operations.
One of NumPy's core components is the N-dimensional array, or ndarray, which represents a collection of items that are the same type and size. An associated data-type object describes the format of the data elements in an array. The same data can be shared by multiple ndarrays, and data changes made in one can be viewed in another.
NumPy was created in 2006 by combining and modifying elements of two earlier libraries. The NumPy website touts it as "the universal standard for working with numerical data in Python," and it is generally considered one of the most useful libraries for Python because of its numerous built-in functions. It's also known for its speed, partly resulting from the use of optimized C code at its core. In addition, various other Python libraries are built on top of NumPy.
Another popular open source Python library, pandas typically is used for data analysis and manipulation. Built on top of NumPy, it features two primary data structures: the Series one-dimensional array and the DataFrame, a two-dimensional structure for data manipulation with integrated indexing. Both can accept data from NumPy ndarrays and other inputs; a DataFrame can also incorporate multiple Series objects.
Created in 2008, pandas has built-in data visualization capabilities, exploratory data analysis functions and support for file formats and languages that include CSV, SQL, HTML and JSON. Additionally, it provides features such as intelligent data alignment, integrated handling of missing data, flexible reshaping and pivoting of data sets, data aggregation and transformation, and the ability to quickly merge and join data sets, according to the pandas website.
The developers of pandas say their goal is to make it "the fundamental high-level building block for doing practical,real-worlddata analysis in Python."Key code paths in pandas are written in C or the Cython superset of Python to optimize its performance, and the library can be used with various kinds of analytical and statistical data, including tabular, time series and labeled matrix data sets.
Python is the most widely used programming language for data science and machine learning and one of the most popular languages overall. The Python open source project's website describes it as "an interpreted, object-oriented, high-level programming language with dynamic semantics," as well as built-in data structures and dynamic typing and binding capabilities. The site also touts Python's simple syntax, saying it's easy to learn and its emphasis on readability reduces the cost of program maintenance.
The multipurpose language can be used for a wide range of tasks, including data analysis, data visualization, AI, natural language processing and robotic process automation. Developers can create web, mobile and desktop applications in Python, too. In addition to object-oriented programming, it supports procedural, functional and other types, plus extensions written in C or C++.
Python is used not only by data scientists, programmers and network engineers, but also by workers outside of computing disciplines, from accountants to mathematicians and scientists, who often are drawn to its user-friendly nature. Python 2.x and 3.x are both production-ready versions of the language, although support for the 2.x line ended in 2020.
An open source framework used to build and train deep learning models based on neural networks, PyTorch is touted by its proponents for supporting fast and flexible experimentation and a seamless transition to production deployment. The Python library was designed to be easier to use than Torch, a precursor machine learning framework that's based on the Lua programming language. PyTorch also provides more flexibility and speed than Torch, according to its creators.
First released publicly in 2017, PyTorch uses arraylike tensors to encode model inputs, outputs and parameters. Its tensors are similar to the multidimensional arrays supported by NumPy, but PyTorch adds built-in support for running models on GPUs. NumPy arrays can be converted into tensors for processing in PyTorch, and vice versa.
The library includes various functions and techniques, including an automatic differentiation package called torch.autograd and a module for building neural networks, plus a TorchServe tool for deploying PyTorch models and deployment support for iOS and Android devices. In addition to the primary Python API, PyTorch offers a C++ one that can be used as a separate front-end interface or to create extensions to Python applications.
The R programming language is an open source environment designed for statistical computing and graphics applications, as well as data manipulation, analysis and visualization. Many data scientists, academic researchers and statisticians use R to retrieve, cleanse, analyze and present data, making it one of the most popular languages for data science and advanced analytics.
The open source project is supported by The R Foundation, and thousands of user-created packages with libraries of code that enhance R's functionality are available -- for example, ggplot2, a well-known package for creating graphics that's part of a collection of R-based data science tools called tidyverse. In addition, multiple vendors offer integrated development environments and commercial code libraries for R.
R is an interpreted language, like Python, and has a reputation for being relatively intuitive. It was created in the 1990s as an alternative version of S, a statistical programming language that was developed in the 1970s; R's name is both a play on S and a reference to the first letter of the names of its two creators.
SAS is an integrated software suite for statistical analysis, advanced analytics, BI and data management. Developed and sold by software vendor SAS Institute Inc., the platform enables users to integrate, cleanse, prepare and manipulate data; then they can analyze it using different statistical and data science techniques. SAS can be used for various tasks, from basic BI and data visualization to risk management, operational analytics, data mining, predictive analytics and machine learning.
The development of SAS started in 1966 at North Carolina State University; use of the technology began to grow in the early 1970s, and SAS Institute was founded in 1976 as an independent company. The software was initially built for use by statisticians -- SAS was short for Statistical Analysis System. But, over time, it was expanded to include a broad set of functionality and became one of the most widely used analytics suites in both commercial enterprises and academia.
Development and marketing are now focused primarily on SAS Viya, a cloud-based version of the platform that was launched in 2016 and redesigned to be cloud-native in 2020.
Scikit-learn is an open source machine learning library for Python that's built on the SciPy and NumPy scientific computing libraries, plus Matplotlib for plotting data. It supports both supervised and unsupervised machine learning and includes numerous algorithms and models, called estimators in scikit-learn parlance. Additionally, it provides functionality for model fitting, selection and evaluation, and data preprocessing and transformation.
Initially called scikits.learn, the library started as a Google Summer of Code project in 2007, and the first public release became available in 2010. The first part of its name is short for SciPy toolkit and is also used by other SciPy add-on packages. Scikit-learn primarily works on numeric data that's stored in NumPy arrays or SciPy sparse matrices.
The library's suite of tools also enables various other tasks, such as data set loading and the creation of workflow pipelines that combine data transformer objects and estimators. But scikit-learn has some limits due to design constraints. For example, it doesn't support deep learning, reinforcement learning or GPUs, and the library's website says its developers "only consider well-established algorithms for inclusion."
SciPy is another open source Python library that supports scientific computing uses. Short for Scientific Python, it features a set of mathematical algorithms and high-level commands and classes for data manipulation and visualization. It includes more than a dozen subpackages that contain algorithms and utilities for functions such as data optimization, integration and interpolation, as well as algebraic equations, differential equations, image processing and statistics.
The SciPy library is built on top of NumPy and can operate on NumPy arrays. But SciPy delivers additional array computing tools and provides specialized data structures, including sparse matrices and k-dimensional trees, to extend beyond NumPy's capabilities.
SciPy actually predated NumPy: It was created in 2001 by combining different add-on modules built for the Numeric library that was one of NumPy's predecessors. Like NumPy, SciPy uses compiled code to optimize performance; in its case, most of the performance-critical parts of the library are written in C, C++ or Fortran.
TensorFlow is an open source machine learning platform developed by Google that's particularly popular for implementing deep learning neural networks. The platform takes inputs in the form of tensors that are akin to NumPy multidimensional arrays and then uses a graph structure to flow the data through a list of computational operations specified by developers. It also offers an eager execution programming environment that runs operations individually without graphs, which provides more flexibility for research and debugging machine learning models.
Google made TensorFlow open source in 2015, and Release 1.0.0 became available in 2017. TensorFlow uses Python as its core programming language and now incorporates the Keras high-level API for building and training models. Alternatively, a TensorFlow.js library enables model development in JavaScript, and custom operations -- or ops, for short -- can be built in C++.
The platform also includes a TensorFlow Extended module for end-to-end deployment of production machine learning pipelines, plus a TensorFlow Lite one for mobile and IoT devices. TensorFlow models can be trained and run on CPUs, GPUs and Google's special-purpose Tensor Processing Units.
Weka is an open source workbench that provides a collection of machine learning algorithms for use in data mining tasks. Weka's algorithms, called classifiers, can be applied directly to data sets without any programming via a GUI or a command-line interface that offers additional functionality; they can also be implemented through a Java API.
The workbench can be used for classification, clustering, regression, and association rule mining applications and also includes a set of data preprocessing and visualization tools. In addition, Weka supports integration with R, Python, Spark and other libraries like scikit-learn. For deep learning uses, an add-on package combines it with the Eclipse Deeplearning4j library.
Weka is free software licensed under the GNU General Public License. It was developed at the University of Waikato in New Zealand starting in 1992; an initial version was rewritten in Java to create the current workbench, which was first released in 1999. Weka stands for the Waikato Environment for Knowledge Analysis and is also the name of a flightless bird native to New Zealand that the technology's developers say has "an inquisitive nature."
Commercially licensed platforms that provide integrated functionality for machine learning, AI and other data science applications are also available from numerous software vendors. The product offerings are diverse -- they include machine learning operations hubs, automated machine learning platforms and full-function analytics suites, with some combining MLOps, AutoML and analytics capabilities. Many platforms incorporate some of the data science tools listed above.
Matlab and SAS can also be counted among the data science platforms. Other prominent platform options for data science teams include the following technologies:
Some platforms are also available in free open source or community editions -- examples include Dataiku and H2O. Knime combines an open source analytics platform with a commercial Knime Hub software package that supports team-based collaboration and workflow automation, deployment and management.
Mary K. Pratt is an award-winning freelance journalist with a focus on covering enterprise IT and cybersecurity management.
Editor's note: This unranked list of data science tools is based on web research from sources such as Capterra, G2 and Gartner as well as vendor websites.
Excerpt from:
18 Data Science Tools to Consider Using in 2024 - TechTarget
Growing Significance of Data Science in the Logistics Industry – CIO Applications
Data logistics professionals are streamlining operations and paving the way for a more resilient and customer-focused future in the field of logistics.
FREMONT, CA:Technological advancements and rapid digital transformation have transformed industries across the board and are harnessing the power of data to optimize operations and gain competitive advantages. The logistics sector stands out as a prime beneficiary of the burgeoning field of Data Science. The need for data-driven decision-making has never been greater as supply chains become more complex and globalized. The mounting importance of data science in logistics is transforming the movement, storage, and management of goods.
The applications of Data Science in logistics are vast, from demand forecasting and route optimization to risk management and customer-centric solutions. Traditionally, logistics management relied heavily on experience and intuition. With the advent of Data Science, this paradigm has shifted towards a more analytical and data-centric approach. Advanced algorithms and machine learning models are now employed to process vast amounts of data generated by various facets of the supply chain, including transportation, inventory management, demand forecasting, and route optimization.
Enhanced demand forecasting
Effective logistics management requires accurate demand forecasting. By leveraging historical data, market trends, and other relevant variables, Data Science empowers logistics professionals to make precise predictions about future demand patterns. It enables them to optimize inventory levels, reduce excess stock, and meet customer expectations with greater precision.
Optimal route planning and fleet management
Efficient transportation is fundamental to the success of any logistics operation. Data Science plays a pivotal role in this aspect by optimizing route planning and fleet management. Algorithms can optimize routes based on traffic conditions, weather forecasts, and real-time updates. Predictive maintenance models help prevent unexpected breakdowns, ensuring that fleets operate at peak efficiency.
Inventory optimization
It is a delicate balance to maintain the right level of inventory. Too much can lead to excessive carrying costs, while too little can result in stockouts and missed opportunities. Data Science algorithms analyze historical sales data, supplier lead times, and seasonal trends to determine the optimal inventory levels for each product. It reduces holding costs and ensures that products are readily available when customers demand them.
Warehouse efficiency and layout design
The layout and operations of warehouses directly affect order fulfillment speed and accuracy. Data-driven insights are employed to design efficient warehouse layouts, minimizing travel distances and maximizing storage capacity. Predictive analytics can anticipate spikes in demand, enabling proactive adjustments to staffing levels and workflows.
Risk management and resilience
The logistics industry is no stranger to disruptions, whether they be natural disasters, geopolitical events, or global health crises. Data Science equips logistics professionals with the tools to assess and mitigate risks effectively. Businesses can ensure business continuity despite adversity by analyzing historical data and employing predictive modeling.
Customer-centric solutions
Data Science enables logistics providers to offer personalized and responsive services. By analyzing customer preferences, order histories, and feedback, businesses can tailor their services accordingly. It enhances customer satisfaction, fosters brand loyalty, and generates positive word-of-mouth.
Original post:
Growing Significance of Data Science in the Logistics Industry - CIO Applications
Navigating the AI Landscape of 2024: Trends, Predictions, and Possibilities – Towards Data Science
The marketing domain, traditionally commanding a lions share of enterprise budgets, is now navigating through a transformative landscape. The catalyst? The rise of chat-based tools like ChatGPT. These innovations are potentially leading to a noticeable decline in traditional search volume, fundamentally altering how consumers engage with information.
In this evolving scenario, marketers find themselves at a crossroads. The ability to influence or monitor brand mentions in these AI-driven dialogues is still in its nascent stages. Consequently, theres a growing trend towards adapting marketing strategies for a generative AI world. This adaptation involves a strategic reliance on traditional media in the short term, leveraging its reach and impact to build and sustain brand presence.
Simultaneously, we are witnessing a significant shift in the technological landscape. The move from browser-based tools to on-device applications is gaining momentum. Leading this charge are innovations like Microsoft Co-Pilot, Google Bard on devices such as Android, and the anticipated launch of Apples own large language model (LLM) sometime in 2024. This transition indicates a paradigm shift from web-centric interactions to a more integrated, device-based AI experience.
This shift extends beyond mere convenience; it represents a fundamental change in user interaction paradigms. As AI becomes more seamlessly integrated into devices, the distinction between online and offline interactions becomes increasingly blurred. Users are likely to interact with AI in more personal, context-aware environments, leading to a more organic and engaging user experience. For tech giants like Google, Microsoft, and Apple, already entrenched in the marketing services world, this represents an opportunity to redefine their offerings.
We can anticipate the emergence of new answer analytics platforms and operating models in marketing to support answer engine optimisation. These tools will likely focus on understanding and leveraging the nuances of AI-driven interactions but potentially better leverage the training data to understand how the results might be portrayed for a given brand or product.
Digital marketeers will start to think more deeply about how they are indexed in these training datasets same as they once did with search engines.
Moreover, the potential launch of ad-sponsored results or media measurement tools by platforms like OpenAI could introduce a new dimension in digital advertising. This development would not only offer new avenues for brand promotion but also challenge existing digital marketing strategies, prompting a reevaluation of metrics and ROI assessment methodologies.
As LLMs migrate into devices, moving away from traditional web interfaces, the marketing landscape is poised for significant changes. Marketers must adapt to these shifts, leveraging both traditional media and emerging AI technologies, to effectively engage with their audiences in this new digital era. This dual approach, combining the impact of traditional media with the precision of AI-driven analytics, could very well be the key to success in the rapidly evolving marketing landscape of 2024.
See the article here:
Navigating the AI Landscape of 2024: Trends, Predictions, and Possibilities - Towards Data Science
9 Effective Techniques To Boost Retrieval Augmented Generation (RAG) Systems – Towards Data Science
2023 was, by far, the most prolific year in the history of NLP. This period saw the emergence of ChatGPT alongside numerous other Large Language Models, both open-source and proprietary.
At the same time, fine-tuning LLMs became way easier and the competition among cloud providers for the GenAI offering intensified significantly.
Interestingly, the demand for personalized and fully operational RAGs also skyrocketed across various industries, with each client eager to have their own tailored solution.
Speaking of this last point, creating fully functioning RAGs, in todays post we will discuss a paper that reviews the current state of the art of building those systems.
Without further ado, lets have a look
If youre interested in ML content, detailed tutorials and practical tips from the industry, follow my newsletter. Its called The Tech Buffet.
I started reading this piece during my vacation
and its a must.
It covers everything you need to know about the RAG framework and its limitations. It also lists modern techniques to boost its performance in retrieval, augmentation, and generation.
The ultimate goal behind these techniques is to make this framework ready for scalability and production use, especially for use cases and industries where answer quality matters *a lot*.
I wont discuss everything in this paper, but here are the key ideas that, in my opinion, would make your RAG more efficient.
Continue reading here:
9 Effective Techniques To Boost Retrieval Augmented Generation (RAG) Systems - Towards Data Science
Data Science in 2024: An Evolving Frontier for Analytics and Insights. – Medium
Data science has exploded in recent years as a field that extracts valuable insights from data to solve complex business problems. According to a 2021 report from Gartner, demand for data and analytics capabilities will set to increase five times by 2024. As businesses become more data-driven and the volume of data continues growing exponentially, data science will only increase in importance over the next few years. 2024 is positioned to mark a notable milestone in which modeling, algorithms, and infrastructure could propel data science into an even more vital strategic function impacting a variety of industries.
Given the surge of big data in motion from IoT sensors, social platforms, mobile devices, and other sources, AI and ML will be integral to efficient advanced analysis. Augmented analytics leverages autonomous techniques so people can shift from doing repetitive tasks to higher cognitive skill sets of critical thinking, insight discovery, and decision evaluation. As Gartner predicts, by 2025 over 50% of analytics queries will be generated using search or NLP-driven interactions rather than code-based authoring.
and visual-based exploration tools will empower business users without technical skills to access, interpret, and interact easily with data. This democratization will facilitate data literacy as analytics permeates across the organization. Metrics and dashboard customization will adapt to various personas and workflows through auto-generated content and recommendations. Predictive modeling will also progress more accessible and transparent self-service offerings as oversight requirements for trust and fairness increase.
While augmented analytics offloads the drudgery, it aims to elevate human judgment and creativity. The symbiotic combination of AI with people who relate context to numbers will lead to the most impactful, nuanced analysis ultimately guiding business strategy. Machines cannot wholly replace human emotion, acumen and domain expertise. As analytics becomes pervasive across enterprises, data science skills to discern high-value problems
Read more:
Data Science in 2024: An Evolving Frontier for Analytics and Insights. - Medium
A Data Science Course Project About Crop Yield and Price Prediction I’m Still Not Ashamed Of – Towards Data Science
Hello, dear reader! During these Christmas holidays, I experienced a feeling of nostalgia for the past student years. Thats why I decided to write a post about a student project that was done almost four years ago as a project on the course Methods and models for Multivariate data analysis during my Masters degree in ITMO University.
Disclaimer: I decided to write this post for two reasons:
The article mentions, in tips format, good practices that Ive been able to apply during course project.
So, at the beginning of the course, we were informed that students could form teams of 23 people on our own and propose a course project that we would present at the end of the course. During the learning process (about 5 months), we will make intermediate presentations to our lecturers. This way, the professors can see how the progress is (or is not) going on.
After that, I immediately teamed up with my dudes: Egor and Camilo (just because we knew how to have fun together), and we started thinking about the topic
I suggested choosing
So, it was
Camilo also wanted to try to make dashboards with visualisations (using PowerBI), but pretty much any task would be suitable for this desire.
Tip 1: Choose a topic that you (and your colleagues) will be passionate about. It may not be the coolest project on a topic that is not very popular, but you will enjoy spending your evenings working on it
The course consisted of a big number of topics each of which is a set of methods for statistical analysis. We decided that we would try to forecast yield and crop price in as many different ways as possible and then ensemble the forecasts using some statistical method. This allowed us to try most of the methods discussed in the course in practice.
Also, the spatio-temporal data was truly multidimensional this related pretty well to the main theme of the course.
Spoiler: we all got a score 5 out of 5
We started with a literature review to understand exactly how crop yield and crop price are predicted. We also wanted to understand what kind of forecast error could be considered satisfactory.
I will not cite in this post the thesis resulting from this review. I will simply mention that we decided to use the following metric and threshold to evaluate the quality of the solution (for both crop yield and crop price):
Acceptable performance: Mean absolute percentage error (MAPE) for a reasonably good forecast should not exceed 10%
2 tip: Start your project (no matter at work or during your studies) with a review of contemporary solutions. Maybe the problem you are looking at now has already been solved.
3 tip: Before starting a development, determine what metric you will use to evaluate the solution. Remember, you cant improve what you cant measure.
Going back to the research, we have identified the following data sources (Links are up to date at 28th of December 2023):
Why these sources? We have assumed that the price of a crop will depend on the amount of product produced. And in agriculture, the quantity produced depends on weather conditions.
The model was implemented for:
So, we have started with an assumption: Wheat, rice, maize, and barley yields depend on weather conditions in the first half of the year (until 30 June) (Figure 2)
The source archives obtained from the European space Agency website contain netCDF files. The files have daily fields for the following parameters:
Based on the initial fields, the following parameters for the first half of each year were calculated:
Thus we obtained matrices for the whole territory of Europe with calculated features for the future model(s). The reader may notice that I calculate such a parameter as The sum of active temperatures above 10 degrees Celsius. This is such a popular parameter in ecology and botany that helps to determine the temperature optimums for different species of organisms (mainly plants, for example The sum of active temperatures as a method of determining the optimum harvest date of Sampion and Ligol apple cultivars)
4 tip: If you have expertise in the domain (which is not related to Data Science), make sure you use it in the project show that you are not only making a fit-predict but also adapting and improving domain-specific approaches
The next step is Aggregation of information by country. For values from the meteorological parameter matrices were extracted for each country separately (Figure 4).
I would note that this strategy made sense (Figure 5): For example, the picture shows that for Spain, wheat yields are almost unaffected by the sum of active temperatures. However, for the Czech Republic, a hotter first half of the year is more likely to result in lower yields. It is therefore a good idea to model yields separately for each country.
Not all of the countrys territory is suitable for agriculture. Therefore, it was necessary to aggregate information only from certain pixels. In order to account for the location of agricultural land, the following matrix was prepared (Figure 6).
So, weve got the data ready. However, agriculture is a very complex industry that has improved markedly year by year, decade by decade. It may make sense to limit the training sample for the model. For this purpose, we used the cumulative sum method (Figure 7):
Cumulative sum method:To each number from the sample, successive numbers are added sequentially to the following. That is, if the sample includes only three years: 1950, 1951, and 1952, the number for 1950 will be plotted on the Y-axis for 1950, and 1951 will show the sum of 1950 and 1951, etc.
- If the shape of the line is close to a straight line and there are no fractures, the sample is homogeneous
- If the shape of the line has fractures the sample is divided into 2 parts based on this fracture
If a fracture was detected, we compared the two samples for belonging to the general population (Kolmogorov-Smirnov statistic). If the samples were statistically significantly different, we used the second part to train the model for prediction. If not, we used the entire sample.
5 tip: Dont be afraid to combine approaches to statistical analysis (it is a course project!). For example, in the lecture we were not told about the cumulative sums method the topic was about comparing distributions. However, I have previously used this approach to compare trends in ice conditions during the processing of ice maps. It seemed to me that it could be useful here as well
I should note here that we have assumed that the process is ergodic, so we decided to compare in this way.
So, after the preparation, we are ready to start building statistical models lets take a look at the most interesting part!
The following features was included in the model:
Target variables: Yield of wheat, rice, maize, and barley
Validation years: 20082018 for each country
Lets move on to the visualisations to make it a little clearer.
And here is Figure 9 showing the residuals (residual = observed value -estimated (predicted) value) from the linear model for France and Italy:
It can be seen from the graphs that the metric is satisfactory, but the error distribution is biased from zero this means that the model has systematic error. We tried to correct in the new models below
Validation sample MAPE metric value: 10.42%
6 tip: Start with the simplest models (e.g. linear regression). This will give you a baseline against which you can compare improved versions of the model. The simpler the model, the better it is, as long as it shows a satisfactory metric
Weve turned the material from this lecture into a model Distribution analysis. The assumption was simple we analysed the distributions of climatic parameters for each year and for the current year and found an analogue year of the current one to predict the value of yield exactly the same as that of the known in the past (Figure 10).
Idea: Yields for years with similar weather conditions will be similar
The approach: Pairwise comparison of temperature, precipitation, and pressure distributions. Prediction-yield for the year that is most similar to the considered one
Distributions used:
For comparison of distributions we used Kruskal-Wallis test. To adjust p-value, a multiple testing correction is introduced the Bonferroni correction.
Validation sample MAPE metric value: 13.80%
7 tip: If you are doing multiple statistical testing, dont forget to include the correction (for example, Bonferroni correction)
One of the lectures was focused on the Bayesian networks. Therefore, we decided to adapt the approach for yield prediction. We considered that each year is described by a set of group of variables A, B, C etc. where A is a set of categories describing crop yields, B is, for example, the Sum of active temperatures conditions and so on. A, for example, could take only three values: High crop yield, Medium crop yield, Low crop yield. The same for B and C and others. Thus, if we categorise the conditions and the target variable, we obtain the following description of each year:
The algorithm was designed to predict a yield category based on a combination of three other categories:
How can we define these categories? by using a clustering algorithm! For example, the following 3 clusters were identified for wheat yields
The final forecast of this model the average yield of the predicted cluster.
Validation sample MAPE metric value: 14.55%
8 tip: Do experiment! Bayesian networks with clustering for time series forecasting? Sure! Pairwise analysis of distributions Why not? Sometimes the boldest approaches lead to significant improvements
Of course, we can forecast the target variable as a time series. Our task here was to understand how classical forecasting methods work in theory and practice.
Putting this method into practice proved to be the easiest. For example, in Python there are several libraries that allow to customise and apply the ARIMA model, for example pmdarima.
Validation sample MAPE metric value: 10.41%
9 tip: Dont forget the comparison with classical approaches. An abstract metric will not tell your colleague much about how good your model is, but a comparison with well-known standards will show the true level of performance
After all the models were built, we explored exactly how each model is mistaken (remember residual plots for linear regression model see Figure 9):
None of the presented algorithms allowed to overcome the 10% threshold (according to MAPE).
The Kalman filter was used to improve the quality of the forecast (to ensemble it). Satisfactory results have been achieved for some countries (Figure 15)
Validation sample MAPE metric value: 9.89%
10 tip: If I were asked to integrate the developed model into Production service, I would integrate either ARIMA or linear regression, even though the ensemble metric is better. However, metrics in business problems are sometimes not the key. A standalone model is sometimes better than an ensemble because it is simpler and more reliable (even if the error metric is slightly higher)
And the final part: model (lasso regression), which used predicted yield values and Futures features to estimate possible price values (Figure 16):
Mape on validation sample: 6.61%
So thats the end of the story. Above there were posted some of tips. And in the last paragraph, I want to summarise the final point and say why I am satisfied with that project. Here are three main items:
Well, we also got great marks on the exam XD
I hope your projects at university and elsewhere will be as exciting for you. Happy New Year!
Sincerely yours, Mikhail Sarafanov
Continue reading here:
Deriving a Score to Show Relative Socio-Economic Advantage and Disadvantage of a Geographic Area – Towards Data Science
There exist publicly accessible data which describe the socio-economic characteristics of a geographic location. In Australia where I reside, the Government through the Australian Bureau of Statistics (ABS) collects and publishes individual and household data on a regular basis in respect of income, occupation, education, employment and housing at an area level. Some examples of the published data points include:
Whilst these data points appear to focus heavily on individual people, it reflects peoples access to material and social resources, and their ability to participate in society in a particular geographic area, ultimately informing the socio-economic advantage and disadvantage of this area.
Given these data points, is there a way to derive a score which ranks geographic areas from the most to the least advantaged?
The goal to derive a score may formulate this as a regression problem, where each data point or feature is used to predict a target variable, in this scenario, a numerical score. This requires the target variable to be available in some instances for training the predictive model.
However, as we dont have a target variable to start with, we may need to approach this problem in another way. For instance, under the assumption that each geographic areas is different from a socio-economic standpoint, can we aim to understand which data points help explain the most variations, thereby deriving a score based on a numerical combination of these data points.
We can do exactly that using a technique called the Principal Component Analysis (PCA), and this article demonstrates how!
ABS publishes data points indicating the socio-economic characteristics of a geographic area in the Data Download section of this webpage, under the Standardised Variable Proportions data cube[1]. These data points are published at the Statistical Area 1 (SA1) level, which is a digital boundary segregating Australia into areas of population of approximately 200800 people. This is a much more granular digital boundary compared to the Postcode (Zipcode) or the States digital boundary.
For the purpose of demonstration in this article, Ill be deriving a socio-economic score based on 14 out of the 44 published data points provided in Table 1 of the data source above (Ill explain why I select this subset later on). These are :
In this section, Ill be stepping through the Python code for deriving a socio-economic score for a SA1 region in Australia using PCA.
Ill start by loading in the required Python packages and the data.
### For dataframe operationsimport numpy as npimport pandas as pd
### For PCAfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScaler
### For Visualizationimport matplotlib.pyplot as pltimport seaborn as sns
### For Validationfrom scipy.stats import pearsonr
file1 = 'data/standardised_variables_seifa_2021.xlsx'
### Reading from Table 1, from row 5 onwards, for column A to ATdata1 = pd.read_excel(file1, sheet_name = 'Table 1', header = 5,usecols = 'A:AT')
data1_dropna = data1.dropna()
An important cleaning step before performing PCA is to standardise each of the 14 data points (features) to a mean of 0 and standard deviation of 1. This is primarily to ensure the loadings assigned to each feature by PCA (think of them as indicators of how important a feature is) are comparable across features. Otherwise, more emphasis, or higher loading, may be given to a feature which is actually not significant or vice versa.
Note that the ABS data source quoted above already have the features standardised. That said, for an unstandardised data source:
### Take all but the first column which is merely a location indicatordata_final = data1_dropna.iloc[:,1:]
### Perform standardisation of datasc = StandardScaler()sc.fit(data_final)
### Standardised datadata_final = sc.transform(data_final)
With the standardised data, PCA can be performed in just a few lines of code:
pca = PCA()pca.fit_transform(data_final)
PCA aims to represent the underlying data by Principal Components (PC). The number of PCs provided in a PCA is equal to the number of standardised features in the data. In this instance, 14 PCs are returned.
Each PC is a linear combination of all the standardised features, only differentiated by its respective loadings of the standardised feature. For example, the image below shows the loadings assigned to the first and second PCs (PC1 and PC2) by feature.
With 14 PCs, the code below provides a visualization of how much variation each PC explains:
exp_var_pca = pca.explained_variance_ratio_plt.bar(range(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.7,label = '% of Variation Explained',color = 'darkseagreen')
plt.ylabel('Explained Variation')plt.xlabel('Principal Component')plt.legend(loc = 'best')plt.show()
As illustrated in the output visualization below, Principal Component 1 (PC1) accounts for the largest proportion of variance in the original dataset, with each following PC explaining less of the variance. To be specific, PC1 explains circa. 35% of the variation within the data.
For the purpose of demonstration in this article, PC1 is chosen as the only PC for deriving the socio-economic score, for the following reasons:
### Using df_plot dataframe per Image 1
sns.heatmap(df_plot, annot = False, fmt = ".1f", cmap = 'summer') plt.show()
To obtain a score for each SA1, we simply multiply the standardised portion of each feature by its PC1 loading. This can be achieved by:
### Perform sum product of standardised feature and PC1 loadingpca.fit_transform(data_final)
### Reverse the sign of the sum product above to make output more interpretablepca_data_transformed = -1.0*pca.fit_transform(data_final)
### Convert to Pandas dataframe, and join raw score with SA1 columnpca1 = pd.DataFrame(pca_data_transformed[:,0], columns = ['Score_Raw'])score_SA1 = pd.concat([data1_dropna['SA1_2021'].reset_index(drop = True), pca1], axis = 1)
### Inspect the raw scorescore_SA1.head()
The higher the score, the more advantaged a SA1 is in terms its access to socio-economic resource.
How do we know the score we derived above was even remotely correct?
For context, the ABS actually published a socio-economic score called the Index of Economic Resource (IER), defined on the ABS website as:
The Index of Economic Resources (IER) focuses on the financial aspects of relative socio-economic advantage and disadvantage, by summarising variables related to income and housing. IER excludes education and occupation variables as they are not direct measures of economic resources. It also excludes assets such as savings or equities which, although relevant, cannot be included as they are not collected in the Census.
Without disclosing the detailed steps, the ABS stated in their Technical Paper that the IER was derived using the same features (14) and methodology (PCA, PC1 only) as what we had performed above. That is, if we did derive the correct scores, they should be comparable against the IER scored published here (Statistical Area Level 1, Indexes, SEIFA 2021.xlsx, Table 4).
As the published score is standardised to a mean of 1,000 and standard deviation of 100, we start the validation by standardising the raw score the same:
score_SA1['IER_recreated'] = (score_SA1['Score_Raw']/score_SA1['Score_Raw'].std())*100 + 1000
For comparison, we read in the published IER scores by SA1:
file2 = 'data/Statistical Area Level 1, Indexes, SEIFA 2021.xlsx'
data2 = pd.read_excel(file2, sheet_name = 'Table 4', header = 5,usecols = 'A:C')
data2.rename(columns = {'2021 Statistical Area Level 1 (SA1)': 'SA1_2021', 'Score': 'IER_2021'}, inplace = True)
col_select = ['SA1_2021', 'IER_2021']data2 = data2[col_select]
ABS_IER_dropna = data2.dropna().reset_index(drop = True)
Validation 1 PC1 Loadings
As shown in the image below, comparing the PC1 loading derived above against the PC1 loading published by the ABS suggests that they differ by a constant of -45%. As this is merely a scaling difference, it doesnt impact the derived scores which are standardised (to a mean of 1,000 and standard deviation of 100).
(You should be able to verify the Derived (A) column with the PC1 loadings in Image 1).
Validation 2 Distribution of Scores
The code below creates a histogram for both scores, whose shapes look to be almost identical.
score_SA1.hist(column = 'IER_recreated', bins = 100, color = 'darkseagreen')plt.title('Distribution of recreated IER scores')
ABS_IER_dropna.hist(column = 'IER_2021', bins = 100, color = 'lightskyblue')plt.title('Distribution of ABS IER scores')
plt.show()
Validation 3 IER score by SA1
As the ultimate validation, lets compare the IER scores by SA1:
## Plot scores on x-y axis. ## If scores are identical, it should show a straight line.
plt.scatter('IER_recreated', 'IER_2021', data = IER_join, color = 'darkseagreen')plt.title('Comparison of recreated and ABS IER scores')plt.xlabel('Recreated IER score')plt.ylabel('ABS IER score')
plt.show()
A diagonal straight line as shown in the output image below supports that the two scores are largely identical.
To add to this, the code below shows the two scores have a correlation close to 1:
The demonstration in this article effectively replicates how the ABS calibrates the IER, one of the four socio-economic indexes it publishes, which can be used to rank the socio-economic status of a geographic area.
Taking a step back, what weve achieved in essence is a reduction in dimension of the data from 14 to 1, losing some information conveyed by the data.
Dimensionality reduction technique such as the PCA is also commonly seen in helping to reduce high-dimension space such as text embeddings to 23 (visualizable) Principal Components.
Visit link: