Category Archives: Data Science
Metas New Data Analyst Professional Certification Has Dropped! – KDnuggets
A new professional certification has arrived on the Coursera platform - Meta Data Analyst Professional Certificate.
If you have had thoughts about entering the data analytics market, now is a great opportunity. The current median US salary for a Data Analyst is $82,000+, with over 90,000+ U.S. job openings in the field.
Lets get right into the course
Link: Meta Data Analyst Professional Certificate
This course is aimed at beginners who are looking to enter the tech industry from a data analyst approach. You can take this course in your own time and at your own pace. The whole course will take you 5 months to complete if you commit 10 hours a week, but if you can commit more you can get it done faster!
The certification is made up of 5 courses:
In these 5 courses, you will learn:
If you plan to take this certification, here are the features and a few perks that come with it:
In 5 months you could be ready to start your new career, with the support and guidance you need to get there. The demand for data analysts continues to grow as the value of data becomes more valuable.
Happy learning!
Nisha Arya is a data scientist, freelance technical writer, and an editor and community manager for KDnuggets. She is particularly interested in providing data science career advice or tutorials and theory-based knowledge around data science. Nisha covers a wide range of topics and wishes to explore the different ways artificial intelligence can benefit the longevity of human life. A keen learner, Nisha seeks to broaden her tech knowledge and writing skills, while helping guide others.
See the original post here:
Metas New Data Analyst Professional Certification Has Dropped! - KDnuggets
Data Science Unicorns, RAG Pipelines, a New Coefficient of Correlation, and Other April Must-Reads – Towards Data Science
Feeling inspired to write your first TDS post? Were always open to contributions from new authors.
Some months, our community appears to be drawn to a very tight cluster of topics: a new model or tool pops up, and everyones attention zooms in on the latest, buzziest news. Other times, readers seem to be moving in dozens of different directions, diving into a wide spectrum of workflows and themes. Last month definitely belongs to the latter camp, and as we looked at the articles that resonated the most with our audience, we were struck (and impressed!) by their diversity of perspectives and focal points.
We hope you enjoy this selection of some of our most-read, -shared, and -discussed posts from April, which include a couple of this years most popular articles to date, and several top-notch (and beginner-friendly) explainers.
Every month, were thrilled to see a fresh group of authors join TDS, each sharing their own unique voice, knowledge, and experience with our community. If youre looking for new writers to explore and follow, just browse the work of our latest additions, including Thomas Reid, Rechitasingh, Anna Zawadzka, Dr. Christoph Mittendorf, Daniel Manrique-Castano, Maxime Wolf, Mia Dwyer, Nadav Har-Tuv, Roger Noble and Martim Chaves, Oliver W. Johnson, Tim Sumner, Jonathan Yahav, Nicolas Lupi, Julian Yip, Nikola Milosevic (Data Warrior), Sara Nbrega, Anand Majmudar, Wencong Yang, Shahzeb Naveed, Soyoung L, Kate Minogue, Sean Sheng, John Loewen, PhD, Lukasz Szubelak, Pasquale Antonante, Ph.D., Roshan Santhosh, Runzhong Wang, Leonardo Maldonado, Jiaqi Chen, Tobias Schnabel, Jess.Z, Lucas de Lima Nogueira, Merete Lutz, Eric Boernert, John Mayo-Smith, Hadrien Mariaccia, Gretel Tan, Sami Maameri, Ayoub El Outati, Samvardhan Vishnoi, Hans Christian Ekne, David Kyle, Daniel Pazmio Vernaza, Vu Trinh, Mateus Trentz, Natasha Stewart, Frida Karvouni, Sunila Gollapudi, and Haocheng Bi, among others.
Read more:
Optimizing enterprise MLOps in the cloud with Domino Data Lab and Amazon Elastic File System | Amazon Web Services – AWS Blog
Domino Data Lab is an AWS Partner Network (APN) partner that provides a central system of record for data science activity across an organization. The Domino solution delivers orchestration for all data science artifacts, including AWS infrastructure, data and services.
As part of the solution, Dominos platform leverages the scale, security, reliability, and cost-effectiveness of AWS cloud computing coupled with Amazon Elastic File System (Amazon EFS). Together they orchestrate all data science artifacts, such as AWS infrastructure, data, and services. This approach lets data science teams benefit from this flexible and collaborative research environment with automated workflows tracking model development dependencies for full reproducibility, including enterprise-grade governance, risk management, and granular cost controls.
In this post, we interview David Schulman, Director of Partner Marketing at Domino Data Lab, and explore the Domino Data lab Enterprise AI Platform solution to consider why centralizing data AI and machine learning operations (MLOps) initiatives into a single system of records across teams can help enterprises to work faster, deploy results sooner, scale rapidly and reduce regulatory and operational risk. In 2023, Domino surveyed artificial intelligence (AI) professionals with their REVelate State of Generative AI survey . The responders included AI professionals leading, developing, and operating AI across Fortune 500 companies. The survey reports that 49% plan to develop generative AI in-house, while 42% plan to fine-tune commercial models. Top limitations focused on security, reliability, cost, and IP protection. Consequently, 69% are worried about data leakage, with both top leadership (82%) and IT (81%) being especially concerned.
According to Ventana Research, Through 2026, nearly all multinational organizations will invest in local data processing and infrastructure and services to mitigate against the risks associated with data transfer. Hybrid cloud environments complicate the operationalizing of AI/ML at scale by creating silos across data, infrastructure, and tooling. Challenges include data science teams that are prevented from collaboration because of siloed data, processes, and tools. Non-standardized, non-repeatable, ad-hoc bespoke workflows result in sprawl on individuals computers across systems. Data and compute resources are distributed across cloud and on-premises data centers, causing unconnected environments and silos. There are also hidden costs resulting from data scientists spending time on DevOps and infrastructure management tasks to prevent underutilized infrastructure due to idle, always-on, and over-provisioned resources.
Over years of partnership, Domino and AWS have worked together to assist organizations, such as Johnson & Johnson (JnJ), in reducing analysis time by 35% for data scientists[1]. This involves integrating Dominos data science platform with essential AWS services, such as Amazon EFS. Amazon EFS provides analytics storage with shared file access to data scientists. Applications include open-source genomics and Shiny, and Domino Data Lab, run on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. Amazon EFS provides access to a fully managed petabyte-scale file system supporting genomics sequence data at 500 TB. More recently, JnJ has further scaled data science across their hybrid and multicloud environment, adopting AI infrastructure strategies straddling on-premises data centers and the cloud to address concerns over cost, security, and regulatory compliance. Lilly also centralizes data science to drive value across the healthcare value chain, as discussed last spring on a panel at NVIDIAs GTC AI Developer conference.
Dominos Enterprise AI Platform integrated with key AWS services provides a unified, collaborative, governed, and end-to-end MLOps platform. The solution orchestrates the complete ML lifecycle, providing easy access to data, preferred tools, and infrastructure in any environment. By sharing knowledge, automating workflows across teams, and tracking all changes and dependencies, Domino guarantees complete reproducibility while fostering collaboration. It also helps maintain peak model performance in production while making sure of enterprise-grade model governance and cost-savings. Domino can be deployed into VPCs, or it is available as a SaaS offering on AWS Marketplace. Attributes include:
Domino is Kubernetes-native and can be deployed on Amazon EKS for ease of management across hybrid environments. This enables cloud-native scalability, multi-cloud portability, reduced costs through elastic workloads paired with underlying hardware resources, and simplified administration by integration with existing DevOps stacks.
Domino can run on a Kubernetes cluster provided byAWS Elastic Kubernetes Service. When running on EKS, the Domino architecture uses AWS resources to fulfill theDomino cluster requirements as in Figure 1.
Figure 1: View Domino Documentation for Domino on Amazon EKS
Seamless collaboration and knowledge sharing is a requirement for data science teams. First, Domino Datasets, integrated with Amazon EFS provide high-performance, versioned, and structured filesystem storage in Domino so that data scientists can build curated pipelines of data in one project and share them with colleagues for collaboration. Amazon EFS enables the sharing of data pools among multiple instances that were previously isolated from one another. This increases data science team productivity because Domino not only tracks snapshots of data used to build models, but all of the underlying code, packages, environments, and all supporting artifacts for full reproducibility providing rich file difference information between revisions. Additionally, customers such as JnJ value the Amazon EFS storage class feature which enables them to automatically move data from Amazon EFS Standard to Amazon EFS Infrequent Access. By automating the process of moving data to long-term, cost-effective storage, the customer successfully reduced their storage cost.
With Amazon EFS, storage capacity is elastic, growing and shrinking automatically as you add and remove files to dynamically provide storage capacity to your applications as needed. With elastic capacity, provisioning is automatic, and youre billed only for what you use. Amazon EFS is designed to be highly scalable both in storage capacity and throughput performance, growing to petabyte scale and allowing massively parallel access from compute instances to your data. This makes it the perfect data science platform foundation for organizations such as JnJ to reduce analysis time by 35%.
Why does this matter? With Amazon EFS, data science teams are empowered to:
Second, Domino Data Sources act as a structured mechanism to create and manage connection properties for external sources such as Amazon S3, Amazon Redshift, and a variety of sources. This reduces DevOps work for data scientists, as they get desktop-like data store connectivity without needing any coding.
Flexible model deployment options support diverse business and operational requirements. Models developed in Domino can be exported for deployment in Amazon SageMaker for scalable and low latency hosting, while models from SageMaker and SageMaker Autopilot can be accessed and monitored inside Domino for drift and prediction performance. Models can be deployed to the cloud, in-database (deploy to Snowflake or Databricks for predictive analytics), or to the edge (Domino supports NVIDIA Fleet Command). Models can be deployed for both batch and real-time predictions at scale, while Domino Model Sentry controls model validation, review, and approval processes for an additional governance layer.
Hybrid cloud support is a necessity for enterprise data science teams, and Domino Nexus acts a single-pane of glass for all data science workloads across hybrid and multi-cloud environments.
Figure 2: Domino/AWS architecture
A Domino Nexus deployment consists of a control plane, a Kubernetes cluster hosting Dominos core platform services (deployed on Amazon EKS as in the preceding Figure 2) and one or multiple data planes: distinct Kubernetes clusters that run a small set of Domino services that can execute workloads. These can be deployed in any cloud region, across multiple clouds, or in on-premises data centers.
Figure 3: Nexus Hybrid Architecture | AWS Cloud Control Plane (US East Region), AWS Cloud Data Plane (EU Central Region)
As shown in Figure 3, Users connect to the Domino control plane through a browser connection, while users connect directly to the data plane where they are doing their model development work in a Domino Workspace. Amazon Elastic Load Balancer (Amazon ELB) allows ingress to Domino control plane services from data planes.
This architecture (Figure 4), eliminates the possibility of inadvertently transferring region-locked data. It also allows data scientists to seamlessly burst to the cloud if they run out of on-premises compute capacity.
Figure 4: Source: Nexus hybrid Architecture Source
Dominos Enterprise AI Platform is proven to deliver an average of 722% ROI in three years (average NPV and ROI based on study of Domino customers, using Domino Business Value Assessment). This is achieved by 2x more models delivered with the same resources, in the same amount of time. A 40% reduction in data scientist time wasted waiting for IT support, doing DevOps, or duplicating work. A 40% reduction in IT and cloud infrastructure costs over three years. Reduced risks of revenue loss from violations of regulations or reputation issues.
Want to go deeper into these metrics? Learn more about Domino and cost-effective AI. Read the following case studies from enterprises with Dominos platform deployed on AWS:
Although generative AI gets all of the attention, large language models (LLMs) are in fact just models. And although there are many intricacies in operating generative AI at scale, such as prompt engineering, model fine-tuning, and inference/hosting (well save that all for another post), the following best practices of scalable, enterprise AI remain the same:
Flexibility is required by modern enterprises, who need to build, deploy, and operate AI at scale across a variety of complex architectures. In addition, storage plays an important role, and Amazon EFS provides a cost-effective, elastic, and highly performant solution for your ML inferencing workloads. You only pay for each inference that you run and the storage consumed by your model on the file system. Amazon EFS provides petabyte-scale elastic storage so your architecture can automatically scale up and down based on the needs of your workload without requiring manual provisioning or intervention.
To learn more about how Dominoon AWS can accelerate responsible AI initiatives? Downloadthe Strategies and Practices for Responsible AI TDWI playbook for insight on a proactive approach on identifying and mitigating business, legal, and ethical risks to create trust and deliver tangible business value.
Visit Domino on AWS Marketplace.
[1] AWS Storage blog Johnson & Johnson reduces analysis time by 35% with their data science platform using Amazon EFS
Read more:
Office of Data Science and Sharing (ODSS) | NICHD – Eunice Kennedy Shriver National Institute of Child Health and … – National Institute of Child…
In collaboration with NICHD, NIH, and external stakeholders, ODSS is building a federated, secure research and specimen data ecosystem that will measurably and rapidly facilitate data and specimen sharing by NICHD-funded researchers and increase access to shared data and specimens for the entire research community.
The NICHD Data Ecosystem includes people, data, processes, and technologies that align with the NICHD Strategic Plan and that support NICHD communities data science and sharing needs. In addition, ODSS is collecting user stories to describe these needs, within the following contexts:
The NICHD Ecosystem Use Case Library on GitHub tracks user stories, use cases, efforts, and associated documentation to improve capabilities across the ecosystem.
ODSS is currently assessing all data repositories that NICHD researchers use to share data or access shared data for secondary use. This assessment and the communities user stories will inform NICHDs approach to improving sustainability and interoperability across the ecosystem to best support the institutes needs. The office will also use information from the assessment to update the NICHD Data Repository Finder .
More:
Siebel School of Computing and Data Science FAQ – Illinois Computer Science News
This is a list of frequently asked questions and links to additional resources related to the launch of the new school. If you have a question not addressed here, please email us.
Will the Siebel School of Computing and Data Science be a separate college at the University of Illinois?
No. What is currently the Department of Computer Science will become The Siebel School of Computing and Data Science, and it will be a part of The Grainger College of Engineering.
Will there be separate departments in the school?
No. The Siebel School of Computing and Data Science will not have any departments.
Will there be any changes to the degrees granted by the Siebel School of Computing and Data Science compared to those offered by the Department of Computer Science?
No. The names of undergraduate degrees and graduate degrees offered by the new school will be the same as those offered by the Department of Computer Science. This includes all Computer Science degrees (BS in Computer Science, MCS in Computer Science, MS in Computer Science, MS in Bioinformatics (Computer Science concentration), and PhD in Computer Science).
Will there be any changes to the diplomas for the degrees offered by the Siebel School of Computing and Data Science compared to those offered by the Department of Computer Science?
No. Since the department or school is not listed on diplomas from the University of Illinois Urbana-Champaign, students' diplomas will remain the same.
Will the courses offered by the school have a new rubric (e.g., CS 225)?
No. The course designations will not change. For example, CS 225 will remain CS 225.
Will the Computer Engineering (CE) major be in the school?
No. At The Grainger College of Engineering, the Computer Engineering major resides in the Department of Electrical and Computer Engineering, and this will remain the same.
Computer Engineering majors currently have priority over other majors for certain computer science courses. Will this change?
No. There will be no change CE majors will continue to have the same registration priorities for computer science courses.
Will the new school contain the Data Science majors?
No. The data science majors will remain in their home colleges.
Will the School of Information Sciences (iSchool) be merged into it?
No. The iSchool will remain a separate school.
Ive been admitted to the computer science major. Is my admission offer still valid?
Yes. Admission decisions do not change.
Ive been denied admission to the computer science major. Can I appeal and ask to be considered for admission to the new school?
The admissions policies remain the same.
Do current undergraduate and graduate students in the Department of Computer Science need to take any action to be moved to the Siebel School of Computing and Data Science?
No, students do not need to take any action.
Will the transcripted information change?
No, the transcripted information will remain the same.
Will the Mathematics & CS, Statistics & CS, CS + X, or X + Data Science programs change or result in dual degrees?
No, they will not. Each of these programs is a single major that leads to a Bachelor of Science degree in the college that is home to Mathematics, Statistics, or the X discipline, respectively.
Will there be any changes to the online Masters in Computer Science programs (MCS and MCS-DS) with the transition to the Siebel School of Computing and Data Science?
No. The online master's degree programs offered by the Department of Computer Science will continue as part of the Siebel School of Computing and Data Science. This includes both the online MCS and the MCS-DS track of the MCS which has a focus on data science.
What will happen with the Illinois Computing Accelerator for Non-Specialists (iCAN) with the transition from the Department of Computer Science to the Siebel School of Computing and Data Science?
iCAN will continue under the Siebel School of Computing and Data Science as it did under the Department of Computer Science.
Will the Chicago MCS program change with the transition from the Department of Computer Science to the Siebel School of Computing and Data Science?
No. The MCS program offered by the Department of Computer science in Chicago will continue as part of the Siebel School of Computing and Data Science.
Will the graduate admissions processes change (including application processes) in the Siebel School of Computing and Data Science from current practices in the Department of Computer Science?
No, the admissions process will remain the same.
How can alumni from the Department of Computer Science indicate the transition to the Siebel School of Computing and Data Science in their resume/CV?
Depending on context, alumni may use their best judgment, but one example might be: Siebel School of Computing and Data Science (formerly the Department of Computer Science).
Go here to read the rest:
Siebel School of Computing and Data Science FAQ - Illinois Computer Science News
Billionaire Tom Siebel Donates $50M to the University of Illinois for Data Science Research – Observer
Tom Siebel in 2017 in Redwood City, California. Courtesy University of Illinois
Tom Siebel, the billionaire head of artificial intelligence (A.I.) company C3.ai, is funneling $50 million into data science research and education at his alma mater. The tech CEOs donation to the University of Illinois Urbana-Champaign will transform its computer science department into the Siebel School of Computing and Data Science.
The new school, pending approval from the University of Illinois Board of Trustees and Illinois Board of Higher Education, will be part of the universitys Grainger College of Engineering. Its establishment exemplifies the University of Illinoiss dedication to pushing the boundaries of knowledge and fostering collaborative solutions to global challenges, said Graingers dean Rashid Bashir in a statement. This transformative gift will empower our faculty and students to lead the next generation of technological advancements, further solidifying our position as a world-renowned institution.
The donation will help the university keep up with the rapid pace of research into computing and data science, which have become a key aspect of engineering alongside physical sciences and mathematical sciences, Bashir told Government Technology magazine. Siebels funds will reportedly support research project opportunities, in addition to funding a new building for the school.
Siebel grew up in Chicago and earned three degreesa bachelors in history, an MBA and a masters in computer sciencefrom the University of Illinois Urbana-Champaign before going on to work as an executive at tech company Oracle. He subsequently created the software firm Siebel Systems and sold it to Oracle in 2006 for $5.8 billion.
He currently has an estimated net worth of $3.7 billion and has since 2009 run C3.ai, which specializes in utilizing enterprise A.I. to aid large-scale companies. It counts the U.S. Air Force, Department of Defense, Shell and Con Edison (ED) among its customers.
We are thrilled to partner with the University of Illinois to establish the Siebel School of Computing and Data Science, said Siebel in a statement. By supporting cutting-edge research and fostering innovation, we hope to empower future generations of leaders in technology and society, driving positive change in our world.
This isnt the first time Siebel has used his fortune to give back to the educational institution. He donated $32 million in 1999 to create a computer science center and another $25 million in 2016 to create a design center, in addition to giving $100 million to its science and engineering programs in 2007.
In 2019, Siebels C3.ai launched a generous education benefit that reimburses eligible employees who pursue masters degrees in computer science with a focus on data science at the University of Illinois Urbana-Champaign. In addition to covering the total degree cost, the company offers employee graduates a 15 percent salary increase, $25,000 cash bonus and stock awards.
See the original post here:
Environmental Implications of the AI Boom | by Stephanie Kirmer | May, 2024 – Towards Data Science
Photo by ANGELA BENITO on Unsplash
Theres a core concept in machine learning that I often tell laypeople about to help clarify the philosophy behind what I do. That concept is the idea that the world changes around every machine learning model, often because of the model, so the world the model is trying to emulate and predict is always in the past, never the present or the future. The model is, in some ways, predicting the future thats how we often think of it but in many other ways, the model is actually attempting to bring us back to the past.
I like to talk about this because the philosophy around machine learning helps give us real perspective as machine learning practitioners as well as the users and subjects of machine learning. Regular readers will know I often say that machine learning is us meaning, we produce the data, do the training, and consume and apply the output of models. Models are trying to follow our instructions, using raw materials we have provided to them, and we have immense, nearly complete control over how that happens and what the consequences will be.
Another aspect of this concept that I find useful is the reminder that models are not isolated in the digital world, but in fact are heavily intertwined with the analog, physical world. After all, if your model isnt affecting the world around us, that sparks the question of why your model exists in the first place. If we really get down to it, the digital world is only separate from the physical world in a limited, artificial sense, that of how we as users/developers interact with it.
This last point is what I want to talk about today how does the physical world shape and inform machine learning, and how does ML/AI in turn affect the physical world? In my last article, I promised that I would talk about how the limitations of resources in the physical world intersect with machine learning and AI, and thats where were going.
This is probably obvious if you think about it for a moment. Theres a joke that goes around about how we can defeat the sentient robot overlords by just turning them off, or unplugging the computers. But jokes aside, this has a real kernel of truth. Those of us who work in machine learning and AI, and computing generally, have complete dependence for our industrys existence on natural resources, such as mined metals, electricity, and others. This has some commonalities with a piece I wrote last year about how human labor is required for machine learning to exist, but today were going to go a different direction and talk about two key areas that we ought to appreciate more as vital to our work mining/manufacturing and energy, mainly in the form of electricity.
If you go out looking for it, there is an abundance of research and journalism about both of these areas, not only in direct relation to AI, but relating to earlier technological booms such as cryptocurrency, which shares a great deal with AI in terms of its resource usage. Im going to give a general discussion of each area, with citations for further reading so that you can explore the details and get to the source of the scholarship. It is hard, however, to find research that takes into account the last 18 months boom in AI, so I expect that some of this research is underestimating the impact of the new technologies in the generative AI space.
What goes in to making a GPU chip? We know these chips are instrumental in the development of modern machine learning models, and Nvidia, the largest producer of these chips today, has ridden the crypto boom and AI craze to a place among the most valuable companies in existence. Their stock price went from the $130 a share at the start of 2021 to $877.35 a share in April 2024 as I write this sentence, giving them a reported market capitalization of over $2 trillion. In Q3 of 2023, they sold over 500,000 chips, for over $10 billion. Estimates put their total 2023 sales of H100s at 1.5 million, and 2024 is easily expected to beat that figure.
GPU chips involve a number of different specialty raw materials that are somewhat rare and hard to acquire, including tungsten, palladium, cobalt, and tantalum. Other elements might be easier to acquire but have significant health and safety risks, such as mercury and lead. Mining these elements and compounds has significant environmental impacts, including emissions and environmental damage to the areas where mining takes place. Even the best mining operations change the ecosystem in severe ways. This is in addition to the risk of what are called Conflict Minerals, or minerals that are mined in situations of human exploitation, child labor, or slavery. (Credit where it is due: Nvidia has been very vocal about avoiding use of such minerals, calling out the Democratic Republic of Congo in particular.)
In addition, after the raw materials are mined, all of these materials have to be processed extremely carefully to produce the tiny, highly powerful chips that run complex computations. Workers have to take on significant health risks when working with heavy metals like lead and mercury, as we know from industrial history over the last 150+ years. Nvidias chips are made largely in factories in Taiwan run by a company called Taiwan Semiconductor Manufacturing Company, or TSMC. Because Nvidia doesnt actually own or run factories, Nvidia is able to bypass criticism about manufacturing conditions or emissions, and data is difficult to come by. The power required to do this manufacturing is also not on Nvidias books. As an aside: TSMC has reached the maximum of their capacity and is working on increasing it. In parallel, NVIDIA is planning to begin working with Intel on manufacturing capacity in the coming year.
After a chip is produced, it can have a lifespan of usefulness that can be significant 35 years if maintained well however, Nvidia is constantly producing new, more powerful, more efficient chips (2 million a year is a lot!) so a chips lifespan may be limited by obsolescence as well as wear and tear. When a chip is no longer useful, it goes into the pipeline of what is called e-waste. Theoretically, many of the rare metals in a chip ought to have some recycling value, but as you might expect, chip recycling is a very specialized and challenging technological task, and only about 20% of all e-waste gets recycled, including much less complex things like phones and other hardware. The recycling process also requires workers to disassemble equipment, again coming into contact with the heavy metals and other elements that are involved in manufacturing to begin with.
If a chip is not recycled, on the other hand, it is likely dumped in a landfill or incinerated, leaching those heavy metals into the environment via water, air, or both. This happens in developing countries, and often directly affects areas where people reside.
Most research on the carbon footprint of machine learning, and its general environmental impact, has been in relation to power consumption, however. So lets take a look in that direction.
Once we have the hardware necessary to do the work, the elephant in the room with AI is definitely electricity consumption. Training large language models consumes extraordinary amounts of electricity, but serving and deploying LLMs and other advanced machine learning models is also an electricity sinkhole.
In the case of training, one research paper suggests that training GPT-3, with 175 billion parameters, runs around 1,300 megawatt hours (MWh) or 1,300,000 KWh of electricity. Contrast this with GPT-4, which uses 1.76 trillion parameters, and where the estimated power consumption of training was between 51,772,500 and 62,318,750 KWh of electricity. For context, an average American home uses just over 10,000 KWh per year. On the conservative end, then, training GPT-4 once could power almost 5,000 American homes for a year. (This is not considering all the power consumed by preliminary analyses or tests that almost certainly were required to prepare the data and get ready to train.)
Given that the power usage between GPT-3 and GPT-4 training went up approximately 40x, we have to be concerned about the future electrical consumption involved in next versions of these models, as well as the consumption for training models that generate video, image, or audio content.
Past the training process, which only needs to happen once in the life of a model, theres the rapidly growing electricity consumption of inference tasks, namely the cost of every time you ask Chat-GPT a question or try to generate a funny image with an AI tool. This power is absorbed by data centers where the models are running so that they can serve results around the globe. The International Energy Agency predicted that data centers alone would consume 1,000 terawatts in 2026, roughly the power usage of Japan.
Major players in the AI industry are clearly aware of the fact that this kind of growth in electricity consumption is unsustainable. Estimates are that data centers consume between .5% and 2% of all global electricity usage, and potentially could be 25% of US electricity usage by 2030.
Electrical infrastructure in the United States is not in good condition we are trying to add more renewable power to our grid, of course, but were deservedly not known as a country that manages our public infrastructure well. Texas residents in particular know the fragility of our electrical systems, but across the US climate change in the form of increased extreme weather conditions causes power outages at a growing rate.
Whether investments in electricity infrastructure have a chance of meeting the skyrocketing demand wrought by AI tools is still to be seen, and since government action is necessary to get there, its reasonable to be pessimistic.
In the meantime, even if we do manage to produce electricity at the necessary rates, until renewable and emission-free sources of electricity are scalable, were adding meaningfully to the carbon emissions output of the globe by using these AI tools. At a rough estimate of 0.86 pounds of carbon emissions per KWh of power, training GPT-4 output over 20,000 metric tons of carbon into the atmosphere. (In contrast, the average American emits 13 metric tons per year.)
As you might expect, Im not out here arguing that we should quit doing machine learning because the work consumes natural resources. I think that workers who make our lives possible deserve significant workplace safety precautions and compensation commensurate with the risk, and I think renewable sources of electricity should be a huge priority as we face down preventable, human caused climate change.
But I talk about all this because knowing how much our work depends upon the physical world, natural resources, and the earth should make us humbler and make us appreciate what we have. When you conduct training or inference, or use Chat-GPT or Dall-E, you are not the endpoint of the process. Your actions have downstream consequences, and its important to recognize that and make informed decisions accordingly. You might be renting seconds or hours of use of someone elses GPU, but that still uses power, and causes wear on that GPU that will eventually need to be disposed of. Part of being ethical world citizens is thinking about your choices and considering your effect on other people.
In addition, if you are interested in finding out more about the carbon footprint of your own modeling efforts, theres a tool for that: https://www.green-algorithms.org/
The rest is here:
Environmental Implications of the AI Boom | by Stephanie Kirmer | May, 2024 - Towards Data Science
Why and When to Use the Generalized Method of Moments – Towards Data Science
7 min read
Hansen (1982) pioneered the introduction of the generalized method of moments (GMM), making notable contributions to empirical research in finance, particularly in asset pricing. The creation of the model was motivated by the need to estimate parameters in economic models while adhering to the theoretical constraints implicit in the model. For example, if the economic model states that two things should be independent, the GMM will try to find a solution in which the average of their product is zero. Therefore, understanding GMM can be a powerful alternative for those who need a model in which theoretical conditions are extremely important, but which conventional models cannot satisfy due to the nature of the data.
This estimation technique is widely used in econometrics and statistics to address endogeneity and other issues in regression analysis. The basic concept of the GMM estimator involves minimizing a criterion function by choosing parameters that make the sample moments of the data as close as possible to the population moments. The equation for the Basic GMM Estimator can be expressed as follows:
The GMM estimator aims to find the parameter vector that minimizes this criterion function, thereby ensuring that the sample moments of the data align as closely as possible with the population moments. By optimizing this criterion function, the GMM estimator provides consistent estimates of the parameters in econometric models.
Being consistent means that as the sample size approaches infinity, the estimator converges in probability to the true parameter value (asymptotically normal). This property is crucial for ensuring that the estimator provides reliable estimates as the amount of data increases. Even in the presence of omitted variables, as long as the moment conditions are valid and instruments are correctly specified, GMM can provide consistent estimators. However, the omission of relevant variables can impact the efficiency and interpretation of the estimated parameters.
To be efficient, GMM utilizes Generalized Least Squares (GLS) on Z-moments to improve the precision and efficiency of parameter estimates in econometric models. GLS addresses heteroscedasticity and autocorrelation by weighting observations based on their variance. In GMM, Z-moments are projected into the column space of instrumental variables, similar to a GLS approach. This minimizes variance and enhances parameter estimate precision by focusing on Z-moments and applying GLS techniques.
However, it is important to recognize that the GMM estimator is subject to a series of assumptions that must be considered during its application, which have been listed:
Therefore, GMM is a highly flexible estimation technique and can be applied in a variety of situations, being widely used as a parameter estimation technique in econometrics and statistics. It allows for efficient estimation of parameters under different model specifications and data structures. Its main uses are:
The contrast between the Ordinary Least Squares (OLS) method and the Generalized Method of Moments (GMM) points out different advantages. OLS proves itself efficient under the classical assumptions of linearity, serving as an unbiased linear estimator of minimum variance (BLUE). The fundamental assumptions of a linear regression model include: linearity in the relationship between variables, absence of perfect multicollinearity, zero mean error, homoscedasticity (constant variance of errors), non-autocorrelation of errors and normality of errors. Therefore, OLS is an unbiased, consistent and efficient estimator. Furthermore, it have relatively lower computational complexity.
However, GMM provides more flexibility, which is applicable to a wide range of contexts such as models with measurement errors, endogenous variables, heteroscedasticity, and autocorrelation. It makes no assumptions about the distribution of errors and is applicable to nonlinear models. GMM stands out in cases where we have omitted important variables, multiple moment conditions, nonlinear models, and datasets with heteroscedasticity and autocorrelation.
Conversely, when comparing GMM and Maximum Likelihood Estimation (MLE), it highlights their approaches to handling data assumptions. GMM constructs estimators using data and population moment conditions, providing flexibility and adaptability to models with fewer assumptions, particularly advantageous when strong assumptions about data distribution may not hold.
MLE estimates parameters by maximizing the likelihood of the given data, depending on specific assumptions about data distribution. While MLE performs optimally when the assumed distribution closely aligns with the true data-generating process, GMM accommodates various distributions, proving valuable in scenarios where data may not conform to a single specific distribution.
In the hypothetical example demonstrated in Python, we utilize the linearmodels.iv library to estimate a GMM model with the IVGMM function. In this model, consumption serves as the dependent variable, while age and gender (represented as a dummy variable for male) are considered exogenous variables. Additionally, we assume that income is an endogenous variable, while the number of children and education level are instrumental variables.
Instrumental variables in GMM models are used to address endogeneity issues by providing a source of exogenous variation that is correlated with the endogenous regressors but uncorrelated with the error term. The IVGMM function is specifically designed for estimating models in which instrumental variables are used within the framework of GMM.
Therefore, by specifying Consumption as the dependent variable and employing exogenous variables (age and gender) along with instrumental variables (number of children and education) to address endogeneity, this example fits within the GMM context.
Excerpt from:
Why and When to Use the Generalized Method of Moments - Towards Data Science
Dominici named to Time 100 Health list Harvard Gazette – Harvard Gazette
Time magazine has named Francesca Dominici, Clarence James Gamble Professor of Biostatistics, Population and Data Science at the Harvard T.H. Chan School of Public Health and the faculty director of the Harvard Data Science Initiative, to the inaugural 2024 Time 100 Health list in recognition for her groundbreaking contributions to global health.
The Time 100 list, first established in 1999, recognizes the activism, innovation, and achievement of the most influential individuals in the world each year. Time 100 Health is a new annual list of the 100 individuals who have most influenced global health.
I am honored to be recognized on this new list, said Dominici. I cannot thank enough all of the students, postdocs, and collaborators that have worked with me tirelessly, and with enormous enthusiasm, to minimize the climate crisis adverse impacts on health and to use data science to find climate adaptation solutions.
Dominicis research focuses on building AI-ready data platforms and developing causally aware AI approaches to identify ways to prevent death and disease from exposure to environmental contaminants and climate-related stressors. She leads severalinterdisciplinary groups ofscientists exploring questions in environmental health science, climate change, cancer research,and health policy. As faculty director of the Harvard Data Science Initiative (HDSI), she leads several University-wide research programs and activities to advance data science and leverage it to analyze the current breakdown of trust in scientific research. HDSIs programs include a large postdoctoral fellows program that draws outstanding early-career scholars from diverse disciplines as well as the AWS Impact Computing Project, a collaboration with Amazon Web Services that advances innovative solutions to global challenges such as climate change, health inequity, and food insecurity.
Dominici has been named to the Time 100 Health list following her own recent research on using data science to inform U.S. air quality policy, which directly impacted the Feb. 7, 2024, revisions of the National Ambient Air Quality Standards. The revised standards will result in significant public health net benefits that could be worth as much as $46 billion by 2032. In addition to this work, Dominici is currently leading a large group of data scientists in building causally aware AI models trained on claims data from across the entire U.S. healthcare system. Their goal is to identify solutions for climate adaptation and ways of reducing the energy sectors carbon footprint and adverse health impacts in the U.S. and around the world.
Follow this link:
Dominici named to Time 100 Health list Harvard Gazette - Harvard Gazette
Get Underlined Text from Any PDF with Python – Towards Data Science
A step-by-step guide to get underlined text as an array from PDF files.
If you want to see the code for this project, check out my repository: https://github.com/sasha-korovkina/pdfUnderlinedExtractor
PDF data extraction can be a real headache, and it gets even trickier when youre trying to snag underlined text believe it or not, there arent any go-to solutions or libraries that handle this out of the box. But dont worry, Im here to show you how to tackle this.
Extracting underlined text from PDFs can take a few different paths. You might consider using OCR to detect text components with bottom lines or delve into PyMuPDFs markup capabilities. However, Ive found that OCR tends to falter, suffering from inconsistency and low accuracy. PyMuPDF isnt my favorite either it demands finicky parameter tuning, which is time-consuming. Plus, one wrong setting and you could lose a bunch of data.
It is important to remember that PDFs are:
But fear not, as we have a strategy to resolve this.
We will use the pdfquery library, the most comprehensive PDF to XML converter which I have come across.
2. Studying the XML
The XML has a few key components which we are interested in:
LTRect component example:
Therefore, by converting the whole document into XML format, we can replicate its structure as XML components, lets do just that!
Now, we will re-create the structure of our document as bounding box coordinates. To do this, we will parse the XML to define the page, component boxes, lines and rectangles, and then draw them all on our canvas in 3 different colors.
Here is our inital PDF, it has been generated in Microsoft Word, by exporting a document with some underlines to the PDF file format:
After applying the algorithm above, here is the visual representation we get:
This image represents the structure of our document, where the black box is used to describe all components on the page, and the blue is used to describe the LTRect elements, hence the underlined text.
Now, lets visualize all of the text within the PDF in its respective positions, with the following line of code:
Here is the output:
Note that the text is not exactly where it was in the original document, due to the difference in size and font of the mark-up language in the pdfquery library.
As the result of our XML, we will have an array of coordinates of underlined regions, in my case I have called it underline_text.
Heres the process:
This method of extracting text from PDFs using coordinate rectangles and Tesseract OCR is effective for several reasons:
And this is the code:
Make sure that you have tesseract installed on your system before running this function. For in-depth instructions, check out their official installation guide here: https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md or in my GitHub repository here: https://github.com/sasha-korovkina/pdfUnderlinedExtractor.
Now, If we take any PDF file, like this example file:
We have some underlined words in this file:
After running the code described above, here is what we get:
After getting this array, you can use these words for further processing!
Enjoy using this script! Id love to hear about any creative applications you come up with or if youd like to contribute. Let me know!
Read the original here:
Get Underlined Text from Any PDF with Python - Towards Data Science