Category Archives: Data Science
Unlock real-time insights with AI-powered analytics in Microsoft Fabric – Microsoft
The data and analytics landscape is changing faster than ever. From the emergence of generative AI to the proliferation of citizen analysts to the increasing importance of real-time, autonomous action, keeping up with the latest trends can feel overwhelming. Every trend requires new services that customers must manually stitch into their data estatedriving up both cost and complexity.
With Microsoft Fabric, we are simplifying and future-proofing your data estate with an ever-evolving, AI-powered data analytics platform. Fabric will keep up with the trends for you and seamlessly integrate each new capability so you can spend less time integrating and managing your data estate and more time unlocking value from your data.
Set up Fabric for your business and discover resources that help you take the first steps
Aurizon, Australias largest rail freight operator, turned to Fabric to modernize their data estate and analytics system.
With Microsoft Fabric, weve answered many of our questions about navigating future growth, to remove legacy systems, and to streamline and simplify our architecture. A trusted data platform sets us up to undertake complex predictive analytics and optimizations that will give greater surety for our business and drive commercial benefits for Aurizon and our customers in the very near future.
Aurizon is just one among thousands of customers who have already used Fabric to revolutionize how they connect to and analyze their data. In fact, a 2024 commissioned Total Economic Impact (TEI) study conducted by Forrester Consulting found that Microsoft Fabric customers saw a three-year 379% return on investment (ROI) with a payback period of less than six months. We are thrilled to share a huge range of new capabilities coming to Fabric. These innovations will help you more effectively uncover insights and keep you at the forefront of the trends in data and analytics. Check out a quick overview of the biggest changes coming to Fabric.
Prepare your data for AI innovation with Microsoft Fabricnow generally available
Fabric is a complete data platformgiving your data teams the ability to unify, transform, analyze, and unlock value from data from a single, integrated software as a service (SaaS) experience. We are excited to announce additions to the Fabric workloads that will make Fabrics capabilities even more robust and even customizable to meet the unique needs of each organization. These enhancements include:
When we introduced Fabric, it launched with seven core workloads which included Synapse Real-time Analytics for data streaming analysis and Data Activator for monitoring and triggering actions in real-time. We are unveiling an enhanced workload called Real-Time Intelligence that combines these workloads and brings an array of additional new features, in preview, to help organizations make better decisions with up-to-the-minute insights. From ingestion to transformation, querying, and taking immediate action, Real-Time Intelligence is an end-to-end experience that enables seamless handling of real-time data without the need to land it first. With Real-Time Intelligence, you can ingest streaming data with high granularity, dynamically transform streaming data, query data in real-time for instant insights, and trigger actions like alerting a production manager when equipment is overheating or rerunning jobs when data pipelines fail. And with both simple, low-code or no-code, and powerful, code-rich interfaces, Real-Time Intelligence empowers every user to work with real-time data.
Behind this powerful workload is the Real-time hub, a single place to discover, manage, and use event streaming data from Fabric and other data sources from Microsoft, third-party cloud providers, and other external data sources. Just like the OneLake data hub makes it easy to discover, manage, and use the data at rest, the Real-time hub can help you do the same for data in motion. All events that flow through the Real-time hub can be easily transformed and routed to any Fabric data store and users can create new streams that can be discovered and consumed. From the Real-time hub, users can gain insights through the data profile, configure the right level of endorsement, set alerts on changing conditions and more, all without leaving the hub. While the existing Real-Time Analytics capabilities are still generally available, the Real-time hub and the other new capabilities coming to the Real-Time Intelligence workload are currently in preview. Watch this demo video to check out the redesigned Real-Time Intelligence experience:
Elcome, one of the worlds largest marine electronics companies, built a new service on Fabric called Welcome that helps maritime crews stay connected to their families and friends.
Microsoft Fabric Real-Time Intelligence has been the essential building block thats enabled us to monitor, manage, and enhance the services we provide. With the help of the Real-time hub for centrally managing data in motion from our diverse sources and Data Activator for event-based triggers, Fabrics end-to-end cloud solution has empowered us to easily understand and act on high-volume, high-granularity events in real-time with fewer resources.
Real-time insights are becoming increasingly critical across industries like route optimization in transportation and logistics, grid monitoring in energy and utilities, predictive maintenance in manufacturing, and inventory management in retail. And since Real-Time Intelligence comes fully optimized and integrated in a SaaS platform, adoption is seamless. Strathan Campbell, Channel Environment Technology Lead at One NZthe largest mobile carrier in New Zealandsaid they went from a concept to a delivered product in just two weeks. To learn more about the Real-Time Intelligence workload, watch the Ingest, analyze and act in real time withMicrosoft Fabric Microsoft Build session or read the Real-Time Intelligence blog.
Fabric was built from the ground up to be extensible, customizable, and open. Now, we are making it even easier for software developers and customers to design, build, and interoperate applications within Fabric with the new Fabric Workload Development Kitcurrently in preview. Applications built with this kit will appear as a native workload within Fabric, providing a consistent experience for users directly in their Fabric environment without any manual effort. Software developers can publish and monetize their custom workloads through Azure Marketplace. And, coming soon, we are creating a workload hub experience in Fabric where users can discover, add, and manage these workloads without ever leaving the Fabric environment. We already have industry-leading partners building on Fabric including SAS, Esri, Informatica, Teradata, and Neo4j.
You can also learn more about the Workload Development Kit by watching the Extend and enhance your analytics applications with Microsoft FabricMicrosoft Build session.
We are also excited to announce two new features, both in preview, created with developers in mind: API for GraphQL and user data functions in Fabric. API for GraphQL is a flexible and powerful RESTful API that allows data professionals to access data from multiple sources in Fabric with a single query API. With API for GraphQL, you can streamline requests to reduce network overheads and accelerate response rates. User data functions are user-defined functions built for Fabric experiences across all data services, such as notebooks, pipelines, or event streams. These features enable developers to build experiences and applications using Fabric data sources more easily like lakehouses, data warehouses, mirrored databases, and more with native code ability, custom logic, and seamless integration. You can watch these features in action in the Introducing API for GraphQL and User Data Functions in Microsoft Fabric Microsoft Build session.
You can also learn more about the Workload Development Kit, the API for GraphQL, user data functions, and more by reading the Integrating ISV apps with Microsoft Fabric blog.
We are also announcing the preview of Data workflows in Fabric as part of the Data Factory experience. Data workflows allow customers to define Directed Acyclic Graphs (DAG) files for complex data workflow orchestration in Fabric. Data workflows is powered by the Apache Airflow runtime and designed to help you author, schedule and monitor workflows or data pipelines using python. Learn more by reading the data workflows blog.
The typical data estate has grown organically over time to span multiple clouds, accounts, databases, domains, and engines with a multitude of vendors and specialized services. OneLake, Fabrics unified, multi-cloud data lake built to span an entire organization, can connect to data from across your data estate and reduce data duplication and sprawl.
We are excited to announce the expansion of OneLake shortcuts to connect to data from on-premises and network-restricted data sources beyond just Azure Data Lake Service Gen2, now in preview. With an on-premises data gateway, you can now create shortcuts to Google Cloud Storage, Amazon S3, and S3 compatible storage buckets that are either on-premises or otherwise network-restricted. To learn more about these announcements, watch the Microsoft Build session Unify your data with OneLake and Microsoft Fabric.
Insights drive impact only when they reach those who can use them to inform actions and decisions. Professional and citizen analysts bridge the gap between data and business results, and with Fabric, they have the tools to quickly manage, analyze, visualize, and uncover insights that can be shared with the entire organization. We are excited to help analysts work even faster and more effectively by releasing the model explorer and the DAX query view in Microsoft Power BI Desktop into general availability.
The model explorer in Microsoft Power BI provides a rich view of all the semantic model objects in the data panehelping you find items in your data fast. You can also use the model explorer to create calculation groups and reduce the number of measures by reusing calculation logic and simplifying semantic model consumption.
The DAX query view in Power BI Desktop lets users discover, analyze, and see the data in their semantic model using the DAX query language. Users working with a model can validate data and measures without having to build a visual or use an additional toolsimilar to the Explore feature. Changes made to measures can be seamlessly updated directly back to the semantic model.
To learn more about these announcements and others coming to Power BI, check out the Power BI blog.
When ChatGPT was launched, it had over 100 million users in just over two monthsthe steepest adoption curve in the history of technology.1 Its been a year and a half since that launch, and organizations are still trying to translate the benefit of generative AI from novelty to actual business results. By infusing generative AI into every layer of Fabric, we can empower your data professionals to employ its benefits, in the right context and in the right scenario to get more done, faster.
Copilot in Fabric was designed to help users unlock the full potential of their data by assisting data professionals to be more productive and business users to explore their data more easily. With Copilot in Fabric, you can use conversational language to create dataflows, generate code and entire functions, build machine learning models, or visualize results. We are excited to share that Copilot in Fabric is now generally available, starting with the Power BI experience. This includes the ability to create stunning reports and summarize your insights into narrative summaries in seconds. Copilot in Fabric is also now enabled on-by-default for all eligible tenants including Copilot in Fabric experiences for Data Factory, Data Engineering, Data Science, Data Warehouse, and Real-Time Intelligence, which are all still in preview. The general availability of Copilot in Fabric for the Power BI experience will be rolling out over the coming weeks to all customers with Power BI Premium capacity (P1 or higher) or Fabric capacity (F64 or higher).
We are also thrilled to announce a new Copilot in Fabric experience for Real-Time Intelligence, currently in preview, that enables users to explore real-time data with ease. Starting with a Kusto Query Language (KQL) Queryset connected to a KQL Database in an Eventhouse or a standalone Azure Data Explorer database, you can type your question in conversational language and Copilot will automatically translate it to a KQL query you can execute. This experience is especially powerful for users less familiar with writing KQL queries but still want to get the most from their time-series data stored in Eventhouse.
We are also thrilled to release a new AI capability in preview called AI skillsan innovative experience designed to provide any user with a conversational Q&A experience about their data. AI skills allow you to simply select the data source in Fabric you want to explore and immediately start asking questions about your dataeven without any configuration. When answering questions, the generative AI experience will show the query it generated to find the answer and you can enhance the Q&A experience by adding more tables, setting additional context, and configuring settings. AI skills can empower everyone to explore data, build and configure AI experiences, and get the answers and insights they need.
AI skills will honor existing security permissions and can be configured to respect the unique language and nuances of your organization, ensuring that responses are not just data-driven but steeped in the context of your business operations. And, coming soon, it can also enrich the creation of new copilots in Microsoft Copilot Studio and be interacted with from Copilot for Microsoft for 365. Its about making your data not just accessible but approachable, inviting users to explore insights through natural dialogue, and shortening the time to insight.
With the launch of Fabric, weve committed to open data formats, standards, and interoperability with our partners to give our customers the flexibility to do what makes sense for their business. We are taking this commitment a step further by expanding our existing partnership with Snowflake to expandinteroperability between Snowflake and Fabrics OneLake. We are excited to announce future support for Apache Iceberg in Fabric OneLake and bi-directional data access between Snowflake and Fabric. This integration will enable users to analyze their Fabric and Snowflake data written in Iceberg format in any engine within either platform, and access data across apps like Microsoft 365, Microsoft Power Platform, and Microsoft Azure AI Studio.
With the upcoming availability of shortcuts for Iceberg in OneLake, Fabric users will be able to access all data sources in Iceberg format, including the Iceberg sources from Snowflake, and translate metadata between Iceberg and Delta formats. This means you can work with a single copy of your data across Snowflake and Fabric. Since all the OneLake data can be accessed in Snowflake as well as in Fabric, this integration will enable you to spend less time stitching together applications and your data estate, and more time uncovering insights. To learn more about this announcement, read theFabric and Snowflake partnership blog.
We are also excited to announce we are expanding our existing relationship with Adobe. Adobe Experience Platform (AEP) and Adobe Campaign will have the ability to federate enterprise data from Fabric. Our joint customers will soon have the capability to connect to Fabric and use the Fabric Data Warehouse for query federation to create and enrich audiences for engagement, without having to transfer or extract the data from Fabric.
We are excited to announce that we are expanding the integration between Fabric and Azure Databricksallowing you to have a truly unified experience across both products and pick the right tools for any scenario.
Coming soon, you will be able to access Azure Databricks Unity Catalog tables directly in Fabric, making it even easier to unify Azure Databricks with Fabric. From the Fabric portal, you can create and configure a new Azure Databricks Unity Catalog item in Fabric with just a few clicks. You can add a full catalog, a schema, or even individual tables to link and the management of this Azure Databricks item in OneLakea shortcut connected to Unity Catalogis automatically taken care of for you.
This data acts like any other data in OneLakeyou can write SQL queries or use it with any other workloads in Fabric including Power BI through Direct Lake mode. When the data is modified or tables are added, removed, or renamed in Azure Databricks, the data in Fabric will remain always in sync. This new integration makes it simple to unify Azure Databricks data in Fabric and seamlessly use it across every Fabric workload.
Also coming soon, Fabric users will be able to access Fabric data items like lakehouses as a catalog in Azure Databricks. While the data remains in OneLake, you can access and view data lineage and other metadata in Azure Databricks and leverage the full power of Unity Catalog. This includes extending Unity Catalogs unified governance over data and AI into Azure Databricks Mosaic AI. In total, you will be able to combine this data with other native and federated data in Azure Databricks, perform analysis assisted by generative AI, and publish the aggregated data back to Power BImaking this integration complete across the entire data and AI lifecycle.
Join us at Microsoft Buildfrom May 21 to 23, 2024 to see all of these announcements in action across the following sessions:
You can also try out these new capabilities and everything Fabric has to offer yourself by signing up for a free 60-day trialno credit card information required. To start your free trial, sign up for a free account (Power BI customers can use their existing account), and once signed in, select start trial within the account manager tool in the Fabric app. Existing Power BI Premium customers can already access Fabric by simply turning on Fabric in their Fabric admin portal. Learn more on the Fabric get started page.
We are excited to announce a European Microsoft Fabric Community Conference that will be held in Stockholm, Sweden from September 23 to 26, 2024. You can see firsthand how Fabric and the rest of the data and AI products at Microsoft can help your organization prepare for the era of AI. You will hear from leading Microsoft and community experts from around the world and get hands on experiences with the latest features from Fabric, Power BI, Azure Databases, Azure AI, Microsoft Purview, and more. You will also have the opportunity to learn from top data experts and AI leaders while having the chance to interact with your peers and share your story. We hope you will join usand see how cutting-edge technologies from Microsoft can enable your business success with the power of Fabric.
If you want to learn more about Microsoft Fabric:
Experience the next generation in analytics
1ChatGPT sets record for fastest-growing user base analyst note, Reuters.
Arun Ulagaratchagan
Corporate Vice President, Azure DataMicrosoft
Arun leads product management, engineering, and cloud operations for Azure Data, which includes databases, data integration, big data analytics, messaging, and business intelligence. The products in his teams' portfolio include Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure MySQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, Power BI, and Microsoft Fabric.
Read more:
Unlock real-time insights with AI-powered analytics in Microsoft Fabric - Microsoft
TECH TUESDAY: Assessing the Present and Future of AI in Markets – Traders Magazine
TECH TUESDAYis a weekly content series covering all aspects of capital markets technology. TECH TUESDAY is produced in collaboration with Nasdaq.
Artificial intelligence (AI), specifically generative AI, has perhaps been the hottest emerging technology topic in markets of late. Traders Magazine caught up withMike ORourke, Head of AI and Emerging Technology at Nasdaq, to learn more about the current AI landscape and how it will evolve.
Tell us about your background and your current role at Nasdaq.
Ive been with Nasdaq for 25 years, the last 10 of which have been primarily focused on emerging technologiesAI, data science and our move to the cloud.
Back in 2016, I ran our data business technology; we were doing a lot of data science, but there was no formal data scienceprogramat Nasdaq. So, we built out a whole career track as well as a center of excellence for AI because, at the time, Brad Peterson, Chief Technology, and Information Officer at Nasdaq, and I anticipated that this was going to be very important. We wanted to start building up talent, a skill set and prowess in AI.
When did Nasdaq start using AI, and what were the early use cases?
The early projects at that time focused on using machine learning and AI language models to understand data and make new alternative data sets. How can we use AI to process unstructured data and convert it into structured data and knowledge so people can make better investments and trading decisions? We also used machine learning models to better understand trading activity and behavior. Why am I not getting the fill rate that I want? Why did this order not execute? How are my trading patterns compared to other firms? Things like that.
How has Nasdaqs cloud journey and partnership with Amazon Web Services (AWS) facilitated AI adoption?
AWS has been a wonderful partner. The cloud is really where innovations happen first, and Nasdaq took the stance that we needed to be early adopters by moving our data and systems there. It was about being agile and able to scale more easily.
Our investment in the cloud is paying offthe new emerging AI technologies are there, and because our data is there, too, we can readily implement these new models on top of our data. Weve invested a lot in having really good data systems, and to have good AI, you need sound data, which we have in the cloud.
AI is one of our strategic pillars across the companywe want to incorporate it into all our products and services, and the cloud is a significant enabler of this.
What current AI use cases are you most excited about?
Id be remiss if I didnt talk aboutDynamic Midpoint Extended Life Order (M-ELO), which is part of our dynamic markets program. We believe we can provide better service to our clients by having order types that can dynamically adapt to market conditions. Its still early days for Dynamic M-ELO, but its been quite successfulwere seeing increased order flow and higher hit rates, and were seeing no degradation in markouts. So, clients are clearly benefiting from the service.
Another part of dynamic markets is our strike optimization program, where we use machine learning models to figure out which strikes we should have listed in the options market. About 1.3 million strikes can be listed in the options market, so manually determining what strikes should be listed is difficult. We think this is a perfect use case where machines can do it better than humans.
We have several other items in the research phase, such as order types and other solutions within the dynamic markets program. We believe these two first launches are just the beginning of how AI will transform markets.
Generative AI has been a very hot topic. Do any AI use cases at Nasdaq utilize generative AI?
Broadly speaking, if theres an investigative component anywhere within a solution set, generative AI can be very valuable.
As for specific applications, ourVerafinand market surveillance solutions are rolling out automated investigators that use generative AI to make investigating illicit activities more effortless. We recently launched a feature inBoardVantagethat allows us to summarize documents, which can make reviewing length board documents easier.
Beyond those examples, there are a host of new product enhancements that use generative AI. Really, anywhere theres a user interface where people are trying to answer questions, youre probably going to see some sort of generative chatbot interface in the near future.
What will the AI adoption weve been talking about mean for Nasdaq and the markets in the future?
Information can be acted upon much more quickly because generative systems can summarize data more quickly.
Im dating myself here, but when I was a kid, if I wanted to research something, I had to go to the libraryit would take me quite a long time to find the answer. With the advent of the internet, I didnt have to drive to the library anymore. And then when mobile came out, I didnt even have to go home to my computer. Timeframes for information access have gotten shorter and shorterand generative AI will further compress that time.
The upshot for markets is that information will be acted upon much more quickly. This is why we think dynamic market solutions are so important.
What about generative AI raises caution flags?
Generative AI is an incredibly exciting area, but there are risks, and people need to take those risks seriously.
How are AI models built, how are they trained and where are they hosted? At Nasdaq, we focus on having excellent AI governance, where anything that goes out is looked at from a compliance, regulatory and technology perspective to ensure its safe, secure and scalable for clients. No AI solution goes out without a rigorous governance process to ensure that the models are effective and can be trusted.
Explainability is also critical. When we came to market with some of our dynamic market solutions, like Dynamic M-ELO, we were very transparent; in fact, we published a whole whitepaper about it. We covered questions like what features are going into the model, how the model behaves and what type of information it will use to make decisions.
Having that transparency and explainability is essential, and its something we value at Nasdaq.
Transparency is a core value for Nasdaq and a key part of what makes modern financial markets work so well. When it comes to generative AI at Nasdaq, transparency is equally important. Being able to explain both where information comes from and why its relevant is critical to ensuring generative AI is trustworthy and effective. Were excited about what transformation opportunities generative AI can provide, and we look forward to continuing this journey of discovery.
Creating tomorrows markets today. Find out more about Nasdaqs offerings to drive your business forwardhere.
Excerpt from:
TECH TUESDAY: Assessing the Present and Future of AI in Markets - Traders Magazine
Mosaic Data Science Named Top AI & Machine Learning Company by CIO Review – Global Banking And Finance Review
The data science consultancy is honored to have received this recognition for the second consecutive year.
Leesburg, VA May 14, 2024 Mosaic Data Science, a trailblazer in data science consulting, has been recognized as theTop AI & Machine Learning Company by CIO Reviewforthe second consecutive year. This prestigious accolade underscores Mosaics commitment to delivering transformative, actionable analytics solutions that empower enterprises to harness the full potential of AI, ML, and mathematical optimization for insightful decision-making.
Where technology meets human ingenuity, transformative innovations that redefine industries are born, said Chris Brinton, CEO of Mosaic Data Science.
This statement encapsulates the core philosophy of Mosaic, a premier artificial intelligence company devoted to equipping enterprises with robust, actionable analytics solutions. Mosaicsteam of data scientistsis renowned for its deep domain expertise, positioning the company as a leader in superior, scalable solutions that drive significant digital transformations. Its unique blend of data engineering, statistical analytics, and problem-solving prowess ensures that strategic business visions and digital aspirations are realized. Through a well-established customer engagement model, Mosaic has carved a niche for itself, consistently delivering a competitive advantage in our tech-driven era.
We transcend typical AI/ML boundaries by delivering solutions rooted in real-world applications that foster growth, enhance operations, and produce measurable outcomes, highlights Brinton. Our mission is to simplify data science, making these powerful technologies accessible and effective for both burgeoning startups and established enterprises looking to expand their capabilities.
Mosaic champions adopting bespoke AI/ML tools tailored to the nuances of client needsbe it enhancing existing teams or managing entire project lifecycles. Their offerings have evolved to include innovative solutions such as theNeural Search Engine,Mosaic.deploy, andData-Driven Decision Dynamics.
Named thetop insight engine of 2024 by CIO Review, our Neural Search engine transcends traditional text-matching limitations, said Mike Shumpert, VP of Data Science. As businesses increasingly embrace GenAI and Large Language Models (LLMs), the strategic advantage lies not just in using these technologies but in expertly tuning them to specific needs. With our pioneering work in Reader/Retrieval Architectures, Neural Search helps unlock significant business value and empower clients with actionable insights for strategic decision-making.
These services and tools ensure that clients find optimal engagement paths tailored to their specific needs, thereby maintaining pace with technological advances and securing industry leadership. Mosaics solutions are particularly noted for integrating custom predictive and prescriptive analytics that seamlessly align with client data and business workflows, enhancing strategic outcomes with precision and expertise.
Mosaic.deploy enables clients to deploy AI/ML models efficiently while remaining sustainable and ethical, said Drew Clancy, VP of Sales & Marketing. This ensures long-term success, with an eye toward achieving explainable AI to help customers leverage these technologies for a sustainable future.
Mosaics expertise extends beyond implementing AI/ML solutions; it guides organizations throughout the adoption lifecycles. Fromscoping and executing AI roadmapsto building asustainable MLOps pipeline, it offers guidance that ensures seamless integration and impactful results.
Each project requires custom AI/ML tuning to guarantee ideal outcomes, said Chris Provan, Managing Director of Data Science. Our methodologiesdesigned to make data science techniques understandable and actionableand our expertise in the niche lead to the best outcomes, transforming challenges into growth opportunities.
Mosaics dedication to ethical AI practices is further demonstrated through its partnership withEpstein Becker Green.Mosaic offersexplainable AI and bias auditing servicesto help de-risk clients AI plans, ensuring that crucial information is ethical and reliable for better decision-making and compliance with industry standards. This partnership is ideal for evaluating AI lifecycles for potential risks, improving governance, and offering holistic solutions for fair AI decisions.
Over the last decade, Mosaic has achieved over 350 successful AI/ML deployments and participated in countless analytics and optimization projectsacross dozens of industries,proving that they arent simply participating in the AI revolutionthey are leading it. This recognition as the top AI & ML company of 2024 by CIO Review affirms Mosaics role as a key innovator in the AI and machine learning space, continually pushing the envelope in analytics development and reinforcing its leadership in continuous innovation.
About Mosaic Data Science
Mosaic Data Science is a leading technology and business innovator known for its expertise in AI and ML. With services ranging from Neural Search Engines to AI/ML roadmaps, Mosaic excels in crafting cutting-edge solutions that propel client business goals.It is the Top AI & Machine Learning Company of 2024.
About CIO Review
CIO Review is a leading technology magazine that bridges the gap between enterprise IT vendors and buyers. As a knowledge network, CIO Review offers a range of in-depth CIO/CXO articles, whitepapers, and research studies on the latest trends in technology.
Read the rest here:
Aggregating Real-time Sensor Data with Python and Redpanda – Towards Data Science
12 min read
In this tutorial, I want to show you how to downsample a stream of sensor data using only Python (and Redpanda as a message broker). The goal is to show you how simple stream processing can be, and that you dont need a heavy-duty stream processing framework to get started.
Until recently, stream processing was a complex task that usually required some Java expertise. But gradually, the Python stream processing ecosystem has matured and there are a few more options available to Python developers such as Faust, Bytewax and Quix. Later, Ill provide a bit more background on why these libraries have emerged to compete with the existing Java-centric options.
But first lets get to the task at hand. We will use a Python libary called Quix Streams as our stream processor. Quix Streams is very similar to Faust, but it has been optimized to be more concise in its syntax and uses a Pandas like API called StreamingDataframes.
You can install the Quix Streams library with the following command:
What youll build
Youll build a simple application that will calculate the rolling aggregations of temperature readings coming from various sensors. The temperature readings will come in at a relatively high frequency and this application will aggregate the readings and output them at a lower time resolution (every 10 seconds). You can think of this as a form of compression since we dont want to work on data at an unnecessarily high resolution.
You can access the complete code in this GitHub repository.
This application includes code that generates synthetic sensor data, but in a real-world scenario this data could come from many kinds of sensors, such as sensors installed in a fleet of vehicles or a warehouse full of machines.
Heres an illustration of the basic architecture:
The previous diagram reflects the main components of a stream processing pipeline: You have the sensors which are the data producers, Redpanda as the streaming data platform, and Quix as the stream processor.
Data producers
These are bits of code that are attached to systems that generate data such as firmware on ECUs (Engine Control Units), monitoring modules for cloud platforms, or web servers that log user activity. They take that raw data and send it to the streaming data platform in a format that that platform can understand.
Streaming data platform
This is where you put your streaming data. It plays more or less the same role as a database does for static data. But instead of tables, you use topics. Otherwise, it has similar features to a static database. Youll want to manage who can consume and produce data, what schemas the data should adhere to. Unlike a database though, the data is constantly in flux, so its not designed to be queried. Youd usually use a stream processor to transform the data and put it somewhere else for data scientists to explore or sink the raw data into a queryable system optimized for streaming data such as RisingWave or Apache Pinot. However, for automated systems that are triggered by patterns in streaming data (such as recommendation engines), this isnt an ideal solution. In this case, you definitely want to use a dedicated stream processor.
Stream processors
These are engines that perform continuous operations on the data as it arrives. They could be compared to just regular old microservices that process data in any application back end, but theres one big difference. For microservices, data arrives in drips like droplets of rain, and each drip is processed discreetly. Even if it rains heavily, its not too hard for the service to keep up with the drops without overflowing (think of a filtration system that filters out impurities in the water).
For a stream processor, the data arrives as a continuous, wide gush of water. A filtration system would be quickly overwhelmed unless you change the design. I.e. break the stream up and route smaller streams to a battery of filtration systems. Thats kind of how stream processors work. Theyre designed to be horizontally scaled and work in parallel as a battery. And they never stop, they process the data continuously, outputting the filtered data to the streaming data platform, which acts as a kind of reservoir for streaming data. To make things more complicated, stream processors often need to keep track of data that was received previously, such as in the windowing example youll try out here.
Note that there are also data consumers and data sinks systems that consume the processed data (such as front end applications and mobile apps) or store it for offline analysis (data warehouses like Snowflake or AWS Redshift). Since we wont be covering those in this tutorial, Ill skip over them for now.
In this tutorial, Ill show you how to use a local installation of Redpanda for managing your streaming data. Ive chosen Redpanda because its very easy to run locally.
Youll use Docker compose to quickly spin up a cluster, including the Redpanda console, so make sure you have Docker installed first.
First, youll create separate files to produce and process your streaming data. This makes it easier to manage the running processes independently. I.e. you can stop the producer without stopping the stream processor too. Heres an overview of the two files that youll create:
As you can see the stream processor does most of the heavy lifting and is the core of this tutorial. The stream producer is a stand-in for a proper data ingestion process. For example, in a production scenario, you might use something like this MQTT connector to get data from your sensors and produce it to a topic.
Youll start by creating a new file called sensor_stream_producer.py and define the main Quix application. (This example has been developed on Python 3.10, but different versions of Python 3 should work as well, as long as you are able to run pip install quixstreams.)
Create the file sensor_stream_producer.py and add all the required dependencies (including Quix Streams)
Then, define a Quix application and destination topic to send the data.
The value_serializer parameter defines the format of the expected source data (to be serialized into bytes). In this case, youll be sending JSON.
Lets use the dataclass module to define a very basic schema for the temperature data and add a function to serialize it to JSON.
Next, add the code that will be responsible for sending the mock temperature sensor data into our Redpanda source topic.
This generates 1000 records separated by random time intervals between 0 and 1 second. It also randomly selects a sensor name from a list of 5 options.
Now, try out the producer by running the following in the command line
You should see data being logged to the console like this:
Once youve confirmed that it works, stop the process for now (youll run it alongside the stream processing process later).
The stream processor performs three main tasks: 1) consume the raw temperature readings from the source topic, 2) continuously aggregate the data, and 3) produce the aggregated results to a sink topic.
Lets add the code for each of these tasks. In your IDE, create a new file called sensor_stream_processor.py.
First, add the dependencies as before:
Lets also set some variables that our stream processing application needs:
Well go into more detail on what the window variables mean a bit later, but for now, lets crack on with defining the main Quix application.
Note that there are a few more application variables this time around, namely consumer_group and auto_offset_reset. To learn more about the interplay between these settings, check out the article Understanding Kafkas auto offset reset configuration: Use cases and pitfalls
Next, define the input and output topics on either side of the core stream processing function and add a function to put the incoming data into a DataFrame.
Weve also added a logging line to make sure the incoming data is intact.
Next, lets add a custom timestamp extractor to use the timestamp from the message payload instead of Kafka timestamp. For your aggregations, this basically means that you want to use the time that the reading was generated rather than the time that it was received by Redpanda. Or in even simpler terms Use the sensors definition of time rather than Redpandas.
Why are we doing this? Well, we could get into a philosophical rabbit hole about which kind of time to use for processing, but thats a subject for another article. With the custom timestamp, I just wanted to illustrate that there are many ways to interpret time in stream processing, and you dont necessarily have to use the time of data arrival.
Next, initialize the state for the aggregation when a new window starts. It will prime the aggregation when the first record arrives in the window.
This sets the initial values for the window. In the case of min, max, and mean, they are all identical because youre just taking the first sensor reading as the starting point.
Now, lets add the aggregation logic in the form of a reducer function.
This function is only necessary when youre performing multiple aggregations on a window. In our case, were creating count, min, max, and mean values for each window, so we need to define these in advance.
Next up, the juicy part adding the tumbling window functionality:
This defines the Streaming DataFrame as a set of aggregations based on a tumbling window a set of aggregations performed on 10-second non-overlapping segments of time.
Tip: If you need a refresher on the different types of windowed calculations, check out this article: A guide to windowing in stream processing.
Finally, produce the results to the downstream output topic:
Note: You might wonder why the producer code looks very different to the producer code used to send the synthetic temperature data (the part that uses with app.get_producer() as producer()). This is because Quix uses a different producer function for transformation tasks (i.e. a task that sits between input and output topics).
As you might notice when following along, we iteratively change the Streaming DataFrame (the sdf variable) until it is the final form that we want to send downstream. Thus, the sdf.to_topic function simply streams the final state of the Streaming DataFrame back to the output topic, row-by-row.
The producer function on the other hand, is used to ingest data from an external source such as a CSV file, an MQTT broker, or in our case, a generator function.
Finally, you get to run our streaming applications and see if all the moving parts work in harmony.
First, in a terminal window, start the producer again:
Then, in a second terminal window, start the stream processor:
Pay attention to the log output in each window, to make sure everything is running smoothly.
You can also check the Redpanda console to make sure that the aggregated data is being streamed to the sink topic correctly (youll fine the topic browser at: http://localhost:8080/topics).
What youve tried out here is just one way to do stream processing. Naturally, there are heavy duty tools such Apache Flink and Apache Spark Streaming which are have also been covered extensively online. But those are predominantly Java-based tools. Sure, you can use their Python wrappers, but when things go wrong, youll still be debugging Java errors rather than Python errors. And Java skills arent exactly ubiquitous among data folks who are increasingly working alongside software engineers to tune stream processing algorithms.
In this tutorial, we ran a simple aggregation as our stream processing algorithm, but in reality, these algorithms often employ machine learning models to transform that data and the software ecosystem for machine learning is heavily dominated by Python.
An oft overlooked fact is that Python is the lingua franca for data specialists, ML engineers, and software engineers to work together. Its even better than SQL because you can use it to do non-data-related things like make API calls and trigger webhooks. Thats one of the reasons why libraries like Faust, Bytewax and Quix evolved to bridge the so-called impedance gap between these different disciplines.
Hopefully, Ive managed to show you that Python is a viable language for stream processing, and that the Python ecosystem for stream processing is maturing at a steady rate and can hold its own against the older Java-based ecosystem.
See the original post here:
Aggregating Real-time Sensor Data with Python and Redpanda - Towards Data Science
Are GPTs Good Embedding Models. A surprising experiment to show that | by Yu-Cheng Tsai | May, 2024 – Towards Data Science
Image by the author using DALL-E
With the growing number of embedding models available, choosing the right one for your machine learning applications can be challenging. Fortunately, the MTEB leaderboard provides a comprehensive range of ranking metrics for various natural language processing tasks.
When you visit the site, youll notice that the top five embedding models are Generative Pre-trained Transformers (GPTs). This might lead you to think that GPT models are the best for embeddings. But is this really true? Lets conduct an experiment to find out.
Embeddings are tensor representation of texts, that converts text token IDs and projects them into a tensor space.
By inputting text into a neural network model and performing a forward pass, you can obtain embedding vectors. However, the actual process is a bit more complex. Lets break it down step by step:
In the first step, I am going to use a tokenizer to achieve it. model_inputs is the tensor representation of the text content, "some questions." .
The second step is straightforward, forward-passing the model_inputs into a neural network. The logits of generated tokens can be accessed via .logits. torch.no_grad() means I dont want the model weights to be updated because the model is in inference mode.
The third step is a bit tricky. GPT models are decoder-only, and their token generation is autoregressive. In simple terms, the last token of a completed sentence has seen all the preceding tokens in the sentence. Therefore, the output of the last token contains all the affinity scores (attentions) from the preceding tokens.
Bingo! You are most interested in the last token because of the attention mechanism in the transformers.
The output dimension of the GPTs implemented in Hugging Face is (batch size, input token size, number of vocabulary). To get the last token output of all the batches, I can perform a tensor slice.
To measure the quality of these GPT embeddings, you can use cosine similarity. The higher the cosine similarity, the closer the semantic meaning of the sentences.
Lets create some util functions that allows us to loop through list of question and answer pairs and see the result. Mistral 7b v0.1 instruct , one of the great open-sourced models, is used for this experiment.
For the first question and answer pair:
For the second question and answer pair:
For an irrelevant pair:
For the worst pair:
These results suggest that using GPT models, in this case, the mistral 7b instruct v0.1, as embedding models may not yield great results in terms of distinguishing between relevant and irrelevant pairs. But why are GPT models still among the top 5 embedding models?
Repeating the same evaluation procedure with a different model, e5-mistral-7b-instruct, which is one of the top open-sourced models from the MTEB leaderboard and fine-tuned from mistral 7b instruct, I discover that the cosine similarity for the relevant question and pairs are 0.88 and 0.84 for OpenAI and GPU questions, respectively. For the irrelevant question and answer pairs, the similarity drops to 0.56 and 0.67. This findings suggests e5-mistral-7b-instruct is a much-improved model for embeddings. What makes such an improvement?
Delving into the paper behind e5-mistral-7b-instruct, the key is the use of contrastive loss to further fine tune the mistral model.
Unlike GPTs that are trained or further fine-tuned using cross-entropy loss of predicted tokens and labeled tokens, contrastive loss aims to maximize the distance between negative pairs and minimize the distance between the positive pairs.
This blog post covers this concept in greater details. The sim function calculates the cosine distance between two vectors. For contrastive loss, the denominators represent the cosine distance between positive examples and negative examples. The rationale behind contrastive loss is that we want similar vectors to be as close to 1 as possible, since log(1) = 0 represents the optimal loss.
In this post, I have highlighted a common pitfall of using GPTs as embedding models without fine-tuning. My evaluation suggests that fine-tuning GPTs with contrastive loss, the embeddings can be more meaningful and discriminative. By understanding the strengths and limitations of GPT models, and leveraging customized loss like contrastive loss, you can make more informed decisions when selecting and utilizing embedding models for your machine learning projects. I hope this post helps you choose GPTs models wisely for your applications and look forward to hearing your feedback! 🙂
Read the original post:
Prompt Like a Data Scientist: Auto Prompt Optimization and Testing with DSPy – Towards Data Science
We will spend some time to go over the environment preparation. Afterwards, this article is divided into 3 sections:
We are now ready to start!
They are the building blocks of prompt programming in DSPy. Lets dive in to see what they are about!
A signature is the most fundamental building block in DSPys prompt programming, which is a declarative specification of input/output behavior of a DSPy module. Signatures allow you to tell the LM what it needs to do, rather than specify how we should ask the LM to do it.
Say we want to obtain the sentiment of a sentence, traditionally we might write such prompt:
But in DSPy, we can achieve the same by defining a signature as below. At its most basic form, a signature is as simple as a single string separating the inputs and output with a ->
Note: Code in this section contains those referred from DSPys documentation of Signatures
The prediction is not a good one, but for instructional purpose lets inspect what was the issued prompt.
We can see the above prompt is assembled from the sentence -> sentiment signature. But how did DSPy came up with the Given the fields in the prompt?
Inspecting the dspy.Predict() class, we see when we pass to it our signature, the signature will be parsed as the signature attribute of the class, and subsequently assembled as a prompt. The instructions is a default one hardcoded in the DSPy library.
What if we want to provide a more detailed description of our objective to the LLM, beyond the basic sentence -> sentiment signature? To do so we need to provide a more verbose signature in form of Class-based DSPy Signatures.
Notice we provide no explicit instruction as to how the LLM should obtain the sentiment. We are just describing the task at hand, and also the expected output.
It is now outputting a much better prediction! Again we see the descriptions we made when defining the class-based DSPy signatures are assembled into a prompt.
This might do for simple tasks, but advanced applications might require sophisticated prompting techniques like Chain of Thought or ReAct. In DSPy these are implemented as Modules
We may be used to apply prompting techniques by hardcoding phrases like lets think step by step in our prompt . In DSPy these prompting techniques are abstracted as Modules. Lets see below for an example of applying our class-based signature to the dspy.ChainOfThought module
Notice how the Reasoning: Lets think step by step phrase is added to our prompt, and the quality of our prediction is even better now.
According to DSPys documentation, as of time of writing DSPy provides the following prompting techniques in form of Modules. Notice the dspy.Predict we used in the initial example is also a Module, representing no prompting technique!
It also have some function-style modules:
6. dspy.majority: Can do basic voting to return the most popular response from a set of predictions.
You can check out further examples in each modules respective guide.
On the other hand, what about RAG? We can chain the modules together to deal with bigger problems!
First we define a retriever, for our example we use a ColBERT retriever getting information from Wikipedia Abstracts 2017
Then we define the RAG class inherited from dspy.Module. It needs two methods:
Note: Code in this section is borrowed from DSPys introduction notebook
Then we make use of the class to perform a RAG
Inspecting the prompt, we see that 3 passages retrieved from Wikipedia Abstracts 2017 is interpersed as context for Chain of Thought generation
The above examples might not seem much. At its most basic application the DSPy seemed only doing nothing that cant be done with f-string, but it actually present a paradigm shift for prompt writing, as this brings modularity to prompt composition!
First we describe our objective with Signature, then we apply different prompting techniques with Modules. To test different prompt techniques for a given problem, we can simply switch the modules used and compare their results, rather than hardcoding the lets think step by step (for Chain of Thought) or you will interleave Thought, Action, and Observation steps (for ReAct) phrases. The benefit of modularity will be demonstrated later in this article with a full-fledged example.
The power of DSPy is not only limited to modularity, it can also optimize our prompt based on training samples, and test it systematically. We will be exploring this in the next section!
In this section we try to optimize our prompt for a RAG application with DSPy.
Taking Chain of Thought as an example, beyond just adding the lets think step by step phrase, we can boost its performance with a few tweaks:
Doing this manually would be highly time-consuming and cant generalize to different problems, but with DSPy this can be done automatically. Lets dive in!
#1: Loading test data: Like machine learning, to train our prompt we need to prepare our training and test datasets. Initially this cell will take around 20 minutes to run.
Inspecting our dataset, which is basically a set of question-and-answer pairs
#2 Set up Phoenix for observability: To facilitate understanding of the optimization process, we launch Phoenix to observe our DSPy application, which is a great tool for LLM observability in general! I will skip pasting the code here, but you can execute it in the notebook.
Note: If you are on Windows, please also install Windows C++ Build Tools here, which is necessary for Phoenix
Then we are ready to see what this opimitzation is about! To train our prompt, we need 3 things:
Now we train our prompt.
Before using the compiled_rag to answer a question, lets see what went behind the scene during the training process (aka compile). We launch the Phoenix console by visiting http://localhost:6006/ in browser
In my run I have made 14 calls using the RAG class, in each of those calls we post a question to LM to obtain a prediction.
Refer to the result summary table in my notebook, 4 correct answers are made from these 14 samples, thus reaching our max_bootstrapped_demos parameter and stopping the calls.
But what are the prompts DSPy issued to obtain the bootstrapped demos? Heres the prompt for question #14. We can see as DSPy tries to generate one bootstrapped demo, it would randomly add samples from our trainset for few-short learning.
Time to put the compiled_rag to test! Here we raise a question which was answered wrongly in our summary table, and see if we can get the right answer this time.
We now get the right answer!
Again lets inspect the prompt issued. Notice how the compiled prompt is different from the ones that were used during bootstrapping. Apart from the few-shot examples, bootstrapped Context-Question-Reasoning-Answer demonstrations from correct predictions are added to the prompt, improving the LMs capability.
So the below is basically went behind the scene with BootstrapFewShot during compilation:
The above example still falls short of what we typically do with machine learning: Even boostrapping maybe useful, we are not yet proving it to improve the quality of the responses.
Ideally, like in traditional machine learning we should define a couple of candidate models, see how they perform against the test set, and select the one achieving the highest performance score. This is what we will do next!
In this section, we want to evaluate what is the best prompt (expressed in terms of module and optimizer combination) to perform a RAG against the HotpotQA dataset (distributed under a CC BY-SA 4.0 License), given the LM we use (GPT 3.5 Turbo).
The Modules under evaluation are:
And the Optimizer candidates are:
As for evaluation metric, we again use exact match as criteria (dspy.evaluate.metrics.answer_exact_match) against the test set.
Lets begin! First, we define our modules
Then define permutations for our model candidates
Then I defined a helper class to facilitate the evaluation. The code is a tad bit long so I am not pasting it here, but it could be found in my notebook. What it does is to apply each the optimizers against the modules, compile the prompt, then perform evaluation against the test set.
We are now ready to start the evaluation, it would take around 20 minutes to complete
Heres the evaluation result. We can see the COT module with BootstrapFewShot optimizer has the best performance. The scores represent the percentage of correct answers (judged by exact match) made for the test set.
But before we conclude the exercise, it might be useful to inspect the result more deeply: Multihop with BootstrapFewShot, which supposedly equips with more relevant context than COT with BootstrapFewShot, has a worse performance. It is strange!
Now head to the Phoenix Console to see whats going on. We pick a random question William Hughes Miller was born in a city with how many inhabitants ?, and inspect how did COT, ReAct, BasicMultiHop with BoostrapFewShot optimizer came up with their answer. You can type this in the search bar for filter: """William Hughes Miller was born in a city with how many inhabitants ?""" in input.value
These are the answers provided by the 3 models during my run:
The correct answer is 7,402 at the 2010 census. Both ReAct with BootstrapFewShot and COT with BootstrapFewShot provided relevant answers, but Multihop with BootstrapFewShot simply failed to provide one.
Checking the execution trace in Phoenix for Multihop with BootstrapFewShot, looks like the LM fails to understand what is expected for the search_query specified in the signature.
So we revise the signature, and re-run the evaluation with the code below
We now see the score improved across all models, and Multihop with LabeledFewShot and Multihop with no examples now have the best performance! This indicates despite DSPy tries to optimize the prompt, there is still some prompt engineering involved by articulating your objective in signature.
The best model now produce an exact match for our question!
Since the best prompt is Multihop with LabeledFewShot, the prompt does not contain bootstrapped Context-Question-Reasoning-Answer demonstrations. So bootstrapping may not surely lead to better performance, we need to prove which one is the best prompt scientifically.
It does not mean Multihop with BootstrapFewShot has a worse performance in general however. Only that for our task, if we use GPT 3.5 Turbo to bootstrap demonstration (which might be of questionable quality) and output prediction, then we might better do without the bootstrapping, and keep only the few-shot examples.
This lead to the question: Is it possible to use a more powerful LM, say GPT 4 Turbo (aka teacher) to generate demonstrations, while keeping cheaper models like GPT 3.5 Turbo (aka student) for prediction?
The answer is YES as the following cell demonstrates, we will use GPT 4 Turbo as teacher.
Using GPT-4 Turbo as teacher does not significantly boost our models performance however. Still it is worthwhile to see its effect to our prompt. Below is the prompt generated just using GPT 3.5
And heres the prompt generated using GPT-4 Turbo as teacher. Notice how the Reasoning is much better articulated here!
Currently we often rely on manual prompt engineering at best abstracted as f-string. Also, for LM comparison we often raise underspecified questions like how do different LMs compare on a certain problem, borrowed from the Stanford NLP papers saying.
But as the above examples demonstrate, with DSPys modular, composable programs and optimizers, we are now equipped to answer toward how they compare on a certain problem with Module X when compiled with Optimizer Y, which is a well-defined and reproducible run, thus reducing the role of artful prompt construction in modern AI.
Thats it! Hope you enjoy this article.
*Unless otherwise noted, all images are by the author
Read this article:
Prompt Like a Data Scientist: Auto Prompt Optimization and Testing with DSPy - Towards Data Science
Plotting Golf Courses in R with Google Earth – Towards Data Science
Once weve finished mapping our hole or course, it is time to export all that hard work into a KML file. This can be done by clicking the three vertical dots on the left side of the screen where your project resides. This project works best with geoJSON data, which we can easily convert our KML file to in the next steps. Now were ready to head to R.
The packages we will need to prepare us for plotting are: sf (for working with geospatial data), tidyverse (for data cleaning and plotting), stringr (for string matching), and geojsonsf (for converting from KML to geoJSON). Our first step is reading in the KML file, which can be done with the st_read() function from sf.
Great! Now we should have our golf course KML data in R. The data frame should have 2 columns: Name (project name, or course name in our case), and geometry (a list of all individual points comprising the polygons we traced). As briefly mentioned earlier, lets convert our KML data to geoJSON and also extract the course name and hole numbers.
To get our maps to point due north we need to project them in a way that preserves direction. We can do this with the st_transform() function.
Were almost ready to plot, but first, we need to tell ggplot2 how each polygon should be colored. Below is the color palette my project is using, but feel free to customize as you wish.
Optional: in this step we can also calculate the centroids of our polygons with the st_centroid() function so we can overlay the hole number onto each green.
Were officially ready to plot. We can use a combination of geom_sf(), geom_text(), and even geom_point() if we want to get fancy and plot shots on top of our map. I typically remove gridlines, axis labels, and the legend for a cleaner look.
And there you have it a golf course plotted in R, what a concept!
To view other courses I have plotted at the time of writing this article, you can visit my Shiny app: https://abodesy14.shinyapps.io/golfMapsR/
If you followed along, had fun in doing so, or are intrigued, feel free to try mapping your favorite courses and create a Pull Request for the golfMapsR repository that I maintain: https://github.com/abodesy14/golfMapsR With some combined effort, we can create a nice little database of plottable golf courses around the world!
Read the original post:
Plotting Golf Courses in R with Google Earth - Towards Data Science
UNE announces new School of Computer Science and Data Analytics – University of New England
Algorithms that process complex financial data. Sensors that track and monitor endangered species. Systems that track patient health records across hospitals and the cybersecurity tools to keep them secure. Computing and data now touch nearly every facet of daily life and the worlds industries, a trend that only continues to grow.
With these rapid technological advancements, the University of New England has announced the formation of a School of Computer Science and Data Analytics offering a diverse range of programs aimed at equipping students with the essential knowledge and skills needed to thrive in today's rapidly advancing technological landscape.
Embedded within UNEs College of Arts and Sciences, the new school reflects UNEs commitment to meeting the rising demand for professionals with skills in emerging technologies,such as artificial intelligence, cybersecurity, and data analytics. UNE has hired Sylvain Jaume, Ph.D., a leading artificial intelligence expert and founder of one of the nations first data science degree programs, as the schools inaugural director.
The school will comprise UNEs existing majors in Applied Mathematics and Data Science, plus two new majors in Computer Science and Statistics, designed to ensure that the regional and national workforce is well-equipped to navigate and contribute to ongoing advancements in each industry.
As we embark on this new venture, we are mindful of the critical role our graduates will play in shaping the future, and specializations in computer science and data science are increasingly sought after in today's job market, remarked Gwendolyn Mahon, UNEs provost and senior vice president of Academic Affairs. The launch of this school aligns with UNEs mission to empower our graduates with the expertise required to drive innovation and address the worlds complex challenges.
According to the U.S. Bureau of Labor Statistics, employment of computer scientists isprojected to grow 23% through 2032, which is much faster than the average for all occupations (3%). A recent study by labor analytics firm Lightcast reported a total of 807,000 positions seeking qualified computer science graduates were posted in 2022 alone.
The Computer Science major at UNE will build on the institutions leading reputation in interdisciplinary learning, fostering connections across the Universitys diverse academic and professional disciplines including the health sciences, biology, marine science, and business and setting students up for varied academic and research opportunities.
New courses in computer architecture, software engineering, and computational theory will prepare students for necessary jobs across a spectrum of fields including health care, financial services, biotechnology, and cybersecurity. Students will additionally gain hands-on experience through internships, providing them with real-world insights into these growing fields as well as valuable networking opportunities.
Enrollment for the new majors will begin in fall 2025.
This transition reflects the agile nature of UNE to rethink how we educate our students to break new ground in Maine and our nations most sought-after industries, remarked Jonathan Millen, Ph.D., dean of the College of Arts and Sciences. These new programs exemplify UNE's dedication to innovation, excellence, and preparing future leaders to tackle the challenges of tomorrow.
See original here:
UNE announces new School of Computer Science and Data Analytics - University of New England
Dissolving map boundaries in QGIS and Python | by Himalaya Bir Shrestha | May, 2024 – Towards Data Science
In an empty QGIS project, by typing world in the coordinate space in the bottom of the page, I could call an in-built map of the world with administrative boundaries of all the countries as shown below.
Next, by using the select feature, I selected the 8 countries of South Asia as highlighted in the map below. QGIS offers the option to select countries by hand, by polygon, by radius, and by individually selecting or deselecting countries with a mouse click.
Clipping these countries off of the world map is straightforward in QGIS. One needs to go to Vector in the menu-> Select Geoprocessing tools -> Select Clip. In the options, I ticked on the check box for the Selected features only in the Input layer and ran the process.
The clipping action was completed in 7.24 seconds alone and I got a new layer called Clipped. This is depicted by the brown color in the screenshot below. By going to Properties of the layer, one can use different coloring options in QGIS in the Symbology option.
Next, I wanted to dissolve the boundaries between countries in South Asia. For this, I selected all the countries in South Asia. I went to the Vector Menu -> Select Geoprocessing Tools ->Dissolve. Similar to the previous step, I selected Selected featured only in the input layer and ran the algorithm which took just 0.08 seconds. A new layer called Dissolved was created where the administrative boundaries between countries were dissolved and appeared as a single unit as shown below:
Visualizing both the world layer and Dissolved layer at the same time looks as shown below:
In this section, I am going to demonstrate how I could the same objective in Python using the geopandas package.
In the first step, I read the in-built dataset of the world map within the geopandas package. It contains the vector data of the world with the administative boundaries of all counntries. This is obtained from the Natural Earth dataset, which is free to use.
In my very first post, I demonstrated how it is possible to clip off a custom Polygon geometry as a mask from the original geopandas dataframe or layer. However, for simplicity, I just used the filter options to obtain the required layers for Asia and South Asia.
To filter the South Asia region, I used a list containing the name of each country as a reference.
To dissolve the boundaries between countries in South Asia, I used the dissolve feature in geopandas. I passed None as an argument, and specified parameters to apply certain aggregate functions, in which the population and GDP in the resulting dissolved dataframe would sum up the population and GDP in all countries in South Asia. I am yet to figure out how the aggregate function can also be applied in QGIS.
Dissolving boundaries between countries within a continent in the world
Using the same procedure as above, I wanted to dissolve the boundaries between countries within a continent and show different continents distinct from each other in a world map based on the number of countries in each continent.
For this purpose, first I added a new column called num_countries in the world geodataframe containing 1 as a value. Then I dissolved the world map using the continent column as a reference.
I used the aggregate function to sum up the population and GDP in all countries in the continent and count the number of countries in each continent. The resulting geodataframe continents_dissolved look as shown:
We see that Asia has the largest population and GDP of all continents. Similarly, we see that Africa has the most countries (51) followed by Asia (47), Europe (39), North America (18), South America (13), and Oceania (7). Antarctica and Seven seas (open ocean) are also regarded as continents in this dataset.
Finally, I wanted to plot the world map highlighting the number of countries in each continent with the help of a color map. I achieved this using the following code:
The resulting map appears as shown below:
In this post, I described ways to dissolve map boundaries using QGIS and geopandas in Python. In the process, I also explained the clipping process and the possibility of using aggregate function while dissolving the map boundaries in geopandas. These processes could be very useful for the manipulation, processing, and transformation of geographical maps in the form of vector datasets. The code and the QGIS project file for this post are available in this GitHub repository. Thank you for reading!
See the article here:
5WPR Technology PR Division Named Among Top in the US – CIOReview
New York, NY - O'Dwyer's, a leading public relations industry publication, has announced its annual PR rankings, naming 5WPRs Technology PR Division the 13thlargest in the US. With net fees over $15 million, the agencys technology practice remains in the top 15 rankings for over 5 years running.
For the last 55 years, O'Dwyer's has been ranking PR agencies based on their fees and has verified by reviewing PR firm income statements.
5Ws technology client partners span the globe and every sector of the space, from adtech and fintech, to artificial intelligence and cybersecurity said Matt Caiola, Co-CEO, 5WPR. It is a fast-changing industry that requires a high level of skill to navigate. The dedication of our team makes all the difference and ensures results-driven work that makes a noticeable difference in our clients brand identity.
Notable clients of the practice include home automation company Samsung SmartThings, legal AI company Casetext, multinational payment and transactional services platform Worldline, data-driven marketing solution Zeta Global, leader in AI-driven narrative and risk intelligence Blackbird.AI, trading software Webull, and the number one enterprise experience platform for critical insights and action, Medallia.
In addition to this recognition, 5WPR has also been named a top-twoNew York City PR agency, and a top US agency by ODwyers this year.
See original here:
5WPR Technology PR Division Named Among Top in the US - CIOReview