Page 962«..1020..961962963964..970980..»

Why Data Science Teams Should Be Using Pair Programming – The New Stack

Data science is a practice that requires technical expertise in machine learning and code development. However, it also demands creativity (for instance, connecting dense numbers and data to real user needs) and lean thinking (like prioritizing the experiments and questions to explore next). In light of these needs, and to continuously innovate and create meaningful outcomes, its essential to adopt processes and techniques that facilitate high levels of energy, drive and communication in data science development.

Pair programming can increase communication, creativity and productivity in data science teams. Pair programming is a collaborative way of working in which two people take turns coding and navigating on the same problem, at the same time, on the same computer connected with two mirrored screens, two mice and two keyboards.

At VMware Tanzu Labs, our data scientists practice pair programming with each other and with our client-side counterparts. Pair programming is more widespread in software engineering than in data science. We see this as a missed opportunity. Lets explore the nuanced benefits of pair programming in the context of data science, delving into three aspects of the data science life cycle and how pair programming can help with each one.

When data scientists pick up a story for development, exploratory data analysis (EDA) is often the first step in which we start writing code. Arguably, among all components of the development cycle that require coding, EDA demands the most creativity from data scientists: The aim is to discover patterns in the data and build hypotheses around how we might be able to use this information to deliver value for the story at hand.

If new data sources need to be explored to deliver the story, we get familiar with them by asking questions about the data and validating what information they are able to provide to us. As part of this process, we scan sample records and iteratively design summary statistics and visualizations for reexamination.

Pairing in this context enables us to immediately discuss and spark a continuous stream of second opinions and tweaks on the statistics and visualizations displayed on the screen; we each build on the energy of our partner. Practicing this level of energetic collaboration in data science goes a long way toward building the creative confidence needed to generate a wider range of hypotheses, and it adds more scrutiny to synthesis when distinguishing between coincidence and correlation.

Based on what we learn about the data from EDA, we next try to summarize a pattern weve observed, which is useful in delivering value for the story at hand. In other words, we build or train a model that concisely and sufficiently represents a useful and valuable pattern observed in the data.

Arguably, this part of the development cycle demands the most science from data scientists as we continuously design, analyze and redesign a series of scientific experiments. We iterate on a cycle of training and validating model prototypes and make a selection as to which one to publish or deploy for consumption.

Pairing is essential to facilitating lean and productive experimentation in model training and validation. With so many options of model forms and algorithms available, balancing simplicity and sufficiency is necessary to shorten development cycles, increase feedback loops and mitigate overall risk in the product team.

As a data scientist, I sometimes need to resist the urge to use a sophisticated, stuffy algorithm when a simpler model fits the bill. I have biases based on prior experience that influence the algorithms explored in model training.

Having my paired data scientist as my data conscience in model training helps me put on the brakes when Im running a superfluous number of experiments, constructively challenges the choices made in algorithm selection and course-corrects me when I lose focus from training prototypes strictly in support of the current story.

In addition to aspects of pair programming that influence productivity in specific components of the development cycle such as EDA and model training/validation, there are also perhaps more mundane benefits of pairing for data science that affect productivity and reproducibility more generally.

Take the example of pipelining. Much of the code written for data science is sequential by nature. The metrics we discover and design in EDA are derived from raw data that requires sequential coding to clean and process. These same metrics are then used as key pieces of information (a.k.a. features) when we build experiments for model training. In other words, the code written to design these metrics is a dependency for the code written for model training. Within model training itself, we often try different versions of a previously trained model (which we have previously written code to build) by exploring different variations of input parameter values to improve accuracy. The components and dependencies described above can be represented as steps and segments in a logical, sequential pipeline of code.

Pairing in the context of pipelining brings benefits in shared accountability driven by a sense of shared ownership of the codebase. While all data scientists know and understand the benefits of segmenting and modularizing code, when coding without a pair, it is easy to slip into a habit of creating overly lengthy code blocks, losing count on similar code being copied-pasted-modified and discounting groups of code dependencies that are only obvious to the person coding. These habits create cobwebs in the codebase and increase risks in reproducibility.

Enter your paired data scientist, who can raise a hand when it becomes challenging to follow the code, highlight groups of code to break up into pipeline segments and suggest blocks of repeated similar code to bundle into reusable functions. Note that this works bidirectionally: when practicing pairing, the data scientist who is typing is fully aware of the shared nature of code ownership and is proactively driven to make efforts to write reproducible code. Pairing is thus an enabler for creating and maintaining a reproducible data science codebase.

If pair programming is new to your data science practice, we hope this post encourages you to explore it with your team. At Tanzu Labs, we have introduced pair programming to many of our client-side data scientists and have observed that the cycles of continuous communication and feedback inherent in pair programming instill a way of working that sparks more creativity in data discovery, facilitates lean experimentation in model training and promotes better reproducibility of the codebase. And lets not forget that we do all of this to deliver outcomes that delight users and drive meaningful business value.

Here are some practical tips to get started with pair programming in data science:

See the original post here:

Why Data Science Teams Should Be Using Pair Programming - The New Stack

Read More..

ChatGPT accelerates chemistry discovery for climate response … – University of California, Berkeley

UC Berkeley experts taught ChatGPT how to quickly create datasets on difficult-to-aggregate research about certain materials that can be used to fight climate change, according to a new paper published in the Journal of the American Chemical Society.

These datasets on the synergy of the highly-porous materials known as metal-organic frameworks (MOFs) will inform predictive models. The models will accelerate chemists ability to create or optimize MOFs, including ones that alleviate water scarcity and capture air pollution. All chemists not just coders can build these databases due to the use of AI-fueled chatbots.

In a world where you have sparse data, now you can build large datasets, said Omar Yaghi, the Berkeley chemistry professor who invented MOFs and an author of the study. There are hundreds of thousands of MOFs that have been reported, but nobody has been able to mine that information. Now we can mine it, tabulate it and build large datasets.

This breakthrough by experts at the College of Computing, Data Science, and Societys Bakar Institute of Digital Materials for the Planet (BIDMaP) will lead to efficient and cost-effective MOFs more quickly, an urgent need as theplanetwarms. It can also be applied to other areas of chemistry. It is one example of how AI can augment and democratize scientific research.

We show that ChatGPT can be a very helpful assistant, said Zhiling Zheng, lead author of the study and a chemistry Ph.D. student at Berkeley. Our ultimate goal is to make [research] much easier.

Other authors of the study, ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis, include the Department of Chemistrys Oufan Zhang and the Department of Electrical Engineering and Computer Sciencess Christian Borgs and Jennifer Chayes. All are affiliated with BIDMaP, except Zhang.

Certain authors are also affiliated with the Kavli Energy Nanoscience Institute, the Department of Mathematics, the Department of Statistics, the School of Information and KACST-UC Berkeley Center of Excellence for Nanomaterials for Clean Energy Applications.

The team guided ChatGPT to quickly conduct a literature review. They curated 228 relevant papers. Then they enabled ChatGPT to process the relevant sections in those papers and to extract, clean and organize that data.

To help them teach ChatGPT to generate accurate and relevant information, they modified an approach called prompt engineering into ChemPrompt Engineering. They developed prompts that avoided asking ChatGPT for made up or misleading content; laid out detailed directions that explained to the chatbot the context and format for the response; and provided the large language model a template or instructions for extracting data.

The chatbots literature review and the experts approach was successful. ChatGPT finished in a fraction of an hour what would have taken a student years to complete, said Borgs, BIDMaPs director. It mined the synthetic conditions of MOFs with 95% accuracy, Yaghi said.

"AI has transformed many other sectors of our society," said Omar Yaghi, BIDMaP's co-director and chief scientist. "Why not transform science?"

One big area of how you do AI for science is probing literature more effectively. This is really a substantial jump in doing natural language processing in chemistry, said Chayes, dean of the College of Computing, Data Science, and Society. And to use it, you can just be a chemist, not a computer scientist.

This development will speed up MOF-related science work, including those efforts aimed at combating climate change, said Borgs. With natural disasters becoming more severe and frequent, we need that saved time, he said.

Yaghi noted that using AI in this way is still new. Like any new tool, experts will need time to identify its shortcomings and address them. But its worth investing the effort, he said.

If we don't use it, then we can't make it better. If we can't make it better, then we will have missed a whole area that society is already using, Yaghi said. AI has transformed many other sectors of our society commerce, banking, travel. Why not transform science?

Originally posted here:

ChatGPT accelerates chemistry discovery for climate response ... - University of California, Berkeley

Read More..

Not Blowing Smoke: Howard University Researchers Highlight Earth … – The Dig

Since 2021, Amy Y. Quarkume, PhD, has investigated the impacts of environmental data bias on eight Black, Brown, and Indigenous communities across the United States.

Quarkume is an Africana Studies professor and the graduate director of Howard University's inaugural Center for Applied Data Science and Analytics program.

Through in-depth interviews with community members, modeling, and mapping, her team of college, high school, and middle school researchers have already identified significant disparities in environmental data representation.

What happens when your local news station, state Department of Environmental Quality or the federal Environmental Protection Agency cant disclose what is in the colored skyline and funny odor you smell in the morningnot because they dont want to give you the information, but because they dont know how to give you data that is specific to your community and situation? Quarkume said.

In 2020, the National Oceanic and Atmospheric Administration (NOAA) announced its strategy to dramatically expand the application of artificial intelligence (AI) in every NOAA mission area. However, as algorithms provided incredibly powerful solutions, they simultaneously disenfranchised marginalized populations.

Quarkume and her teams findings highlight various challenges: inadequate environmental data collection sites, uneven dissemination of environmental information, delays in installing data collection instruments, and a lack of inclusive community voices on environmental concerns. The team argues that implementing AI in this domain will dramatically expand historical and chronic problems.

Imagine a world where there is clean air for all. In order to make that happen, we would need to collect enough data on some of our most at-risk communities to begin to model such a reality.

The multi-year Whats Up with All the Bias project, funded by the National Center for Atmospheric Researchs Innovator Program, connects questions of climate change, race, AI, culture, and environmental justice in hopes of emphasizing the true lived realities of communities of color in data. Black and Hispanic communities are exposed to more air pollution than their white counterparts and left to deal with the effects of environmental racismas the new Jim Crow. The projects intersectional approach skillfully magnifies the negative effects of climate change.

Community organizer and lifelong D.C. resident Sebrena Rhodes thinks about air quality often.

The environmental activist has been outspoken about inequalities in air pollution, urban heat, and other environmental justice issues in the Ivy City and Brentwood neighborhoods for years, even prior to a nationwide increase in air quality app usage. In the wake of this summers ongoing Canadian wildfires, Rhodes is especially vigilant.

Because of the wildfires in Canada, our poor air quality was further exacerbated, said Rhodes. Our air quality, per the Purple Air monitor placed at the Ivy City Clubhouse, went up to 403 which was one of Ivy City's worst AQ days ever!

Purple Air monitors provide community stakeholders with hyperlocal air quality readings that can help them shape their day-to-day experiences.

We check the air quality in the morning, during the lunch hour, and around 2 p.m. Purple Air gives us results of our air quality in real time, every 10 minutes. The data [updates] throughout the day, Rhodes said.

Studies consistently reveal that populations of color bear an unequal burden when it comes to exposure to air pollution. This inequality is evident in fence-line communities, where African Americans are 75% more likely than their white counterparts to reside near commercial facilities that generate noise, odor, traffic, or emissions.

Further, asthma rates are significantly higher among people of color compared to white communities, with Black Americans being nearly one and a half times more likely to suffer from asthma and three times more likely to die from the condition.

According to Whats Up with All the Bias principal investigator, Dr. Quarkume says matters of air quality, heat, racism, policing, and housing often go hand-in-hand, like in her hometown of the Bronx.

Whether it's Asthma Alley in the Bronx, or Cancer Alley in Louisiana, some communities have been dealing with these issues for decades, said Quarkume. They still deserve to know what is in the air. They deserve quality data to make their own decisions and push for change.

Curtis Bay, a working-class Baltimore neighborhood, currently faces disproportionately high amounts of dangerous air pollution and has a long history of industrial accidents and toxic exposures. Activists, residents, and community members alike have fought for these local air quality issues to be addressed.

Accessing quality data has also presented an additional hurdle for frontline communities. In Miamis Little Haiti neighborhood, there are sizable differences between the reported temperature from the National Weather Service and local temperature readings.

These stories have motivated Quarkume and her team to deploy additional air quality monitors, heat sensors, and water quality monitors to communities during the next phase of their project. By supporting local organizations already invested in their communities, she hopes to support community-centered data, data openness, community-centered research, and data equity, principles of the CORE Futures Lab, which she also leads.

Imagine a world where there is clean air for all. In order to make that happen, we would need to collect enough data on some of our most at-risk communities to begin to model such a reality. The data world has yet to substantially invest in such projects Progress is in our ability to translate and empower communities to own and imagine those data points and future for themselves, said Quarkume.

Jessica Moulite and Mikah Jones are PhD research assistants and members of the NCAR Early Career Faculty Innovator Program.

Originally posted here:

Not Blowing Smoke: Howard University Researchers Highlight Earth ... - The Dig

Read More..

Research Associate (Data Science – HRPC) job with NATIONAL … – Times Higher Education

Main Duties and Responsibilities

The main areas of responsibilities of the role will include but limited to the following:

Job Description

The National University of Singapore (NUS) invites applications for the role of Research Associate with the Heat Resilience & Performance Centre (HRPC), Yong Loo Lin School of Medicine. The HRPC is a first-of-its-kind research centre, established at the NUS, to spearhead and conduct research and development to better enable the future challenges arising from rising ambient heat. Appointments will be made on a two-year contract basis, with the possibility of extension.

We are looking for a Research Associate who is excited to be involved in research aimed at developing strategies and solutions that leverage on technology and data science to allow individuals to continue to live, work and thrive in spite of a warming world. This role will see you working as part of a multi-disciplinary team to develop heat health prediction models to be deployed with wearable systems

Qualifications

A minimum qualification of a Masters degree in a quantitative discipline (eg. statistics, mathematics, data science, computer science, computational biology, engineering.). Proficient in statistical software and programming languages and familiarity with relevant libraries.

In addition, the role will require the following job relevant experience or attributes:

Additional Skills

Application

The role is based in Singapore.

Prospective candidates can contact Ms Lydia Law atlydialaw@nus.edu.sg.

Remuneration will be commensurate with the candidates qualifications and experience.

Only shortlisted candidates will be notified.

More Information

Location: Kent Ridge CampusOrganization: Yong Loo Lin School of MedicineDepartment: Dean's Office (Medicine)Employee Referral Eligible: NoJob requisition ID: 20100

Originally posted here:

Research Associate (Data Science - HRPC) job with NATIONAL ... - Times Higher Education

Read More..

The Importance of Data Cleaning in Data Science – KDnuggets

In data science, the accuracy of predictive models is vitally important to ensure any costly errors are avoided and that each aspect is working to its optimal level. Once the data has been selected and formatted, the data needs to be cleaned, a crucial stage of the model development process.

In this article, we will provide an overview of the importance of data cleaning in data science, including what it is, the benefits, the data cleaning process, and the commonly used tools.

In data science, data cleaning is the process of identifying incorrect data and fixing the errors so the final dataset is ready to be used. Errors could include duplicate fields, incorrect formatting, incomplete fields, irrelevant or inaccurate data, and corrupted data.

In a data science project, the cleaning stage comes before validation in the data pipeline. In the pipeline, each stage ingests input and creates output, improving the data each step of the way. The benefit of the data pipeline is that each step has a specific purpose and is self-contained, meaning the data is thoroughly checked.

Data seldom arrives in a readily usable form; in fact, it can be confidently stated that data is never flawless. When collected from diverse sources and real-world environments, data is bound to contain numerous errors and adopt different formats. Hence, the significance of data cleaning arises -- to render the data error-free, pertinent, and easily assimilated by models.

When dealing with extensive datasets from multiple sources, errors can occur, including duplication or misclassification. These mistakes greatly affect algorithm accuracy. Notably, data cleaning and organization can consume up to 80% of a data scientist's time, highlighting its critical role in the data pipeline.

Below are three examples of how data cleaning can fix errors within datasets.

Data Formatting

Data formatting involves transforming data into a specific format or modifying the structure of a dataset. Ensuring consistency and a well-structured dataset is crucial to avoid errors during data analysis. Therefore, employing various techniques during the cleaning process is necessary to guarantee accurate data formatting. This may encompass converting categorical data to numerical values and consolidating multiple data sources into a unified dataset.

Empty/ Missing Values

Data cleaning techniques play a crucial role in resolving data issues such as missing or empty values. These techniques involve estimating and filling in gaps in the dataset using relevant information.

For instance, consider the location field. If the field is empty, scientists can populate it with the average location data from the dataset or a similar one. Although not flawless, having the most probable location is preferable to having no location information at all. This approach ensures improved data quality and enhances the overall reliability of the dataset.

Identifying Outliers

Within a dataset, certain data points may lack any substantive connection to others (e.g., in terms of value or behavior). Consequently, during data analysis, these outliers possess the ability to significantly distort results, leading to misguided predictions and flawed decision-making. However, by implementing various data cleaning techniques, it is possible to identify and eliminate these outliers, ultimately ensuring the integrity and relevance of the dataset.

Data cleaning provides a range of benefits that have a significant impact on the accuracy, relevance, usability, and analysis of data.

The data cleaning stage of the data pipeline is made up of eight common steps:

Large datasets that utilize multiple data sources are highly likely to have errors, including duplicates, particularly when new entries haven't undergone quality checks. Duplicate data is redundant and consumes unnecessary storage space, necessitating data cleansing to enhance efficiency. Common instances of duplicate data comprise repetitive email addresses and phone numbers.

To optimize a dataset, it is crucial to remove irrelevant data fields. This will result in faster model processing and enable a more focused approach toward achieving specific goals. During the data cleaning stage, any data that does not align with the scope of the project will be eliminated, retaining only the necessary information required to fulfill the task.

Standardizing text in datasets is crucial for ensuring consistency and facilitating easy analysis. Correcting capitalization is especially important, as it prevents the creation of false categories that could result in messy and confusing data.

When working with CSV data using Python to manipulate it, analysts often rely on Pandas, the go-to data analysis library. However, there are instances where Pandas fall short in processing data types effectively. To guarantee accurate data conversion, analysts employ cleaning techniques. This ensures that the correct data is easily identifiable when applied to real-life projects.

An outlier is a data point that lacks relevance to other points, deviating significantly from the overall context of the dataset. While outliers can occasionally offer intriguing insights, they are typically regarded as errors that should be removed.

Ensuring the effectiveness of a model is crucial, and rectifying errors before the data analysis stage is paramount. Such errors often result from manual data entry without adequate checking procedures. Examples include phone numbers with incorrect digits, email addresses without an "@" symbol, or unpunctuated user feedback.

Datasets can be gathered from various sources written in different languages. However, when using such data for machine translation, evaluation tools typically rely on monolingual Natural Language Processing (NLP) models, which can only handle one language at a time. Thankfully, during the data cleaning phase, AI tools can come to the rescue by converting all the data into a unified language. This ensures greater coherence and compatibility throughout the translation process.

One of the last steps in data cleaning involves addressing missing values. This can be achieved by either removing records that have missing values or employing statistical techniques to fill in the gaps. A comprehensive understanding of the dataset is crucial in making these decisions.

The importance of data cleaning in data science can never be underestimated as it can significantly impact the accuracy and overall success of a data model. With thorough data cleaning, the data analysis stage is likely to output flawed results and incorrect predictions.

Common errors that need to be rectified during the data cleaning stage are duplicate data, missing values, irrelevant data, outliers, and converting multiple data types or languages into a single form.Nahla Davies is a software developer and tech writer. Before devoting her work full time to technical writing, she managed among other intriguing things to serve as a lead programmer at an Inc. 5,000 experiential branding organization whose clients include Samsung, Time Warner, Netflix, and Sony.

Read the original here:

The Importance of Data Cleaning in Data Science - KDnuggets

Read More..

Data Science More Democratized and Dynamic: Gartner – CDOTrends

According to analyst group Gartner, Cloud Data Ecosystems and Edge AI are two of the top trends impacting the future of data science and machine learning.

Speaking at the Gartner Data & Analytics Summit in Sydney in August 2023, Peter Krensky, director analyst at Gartner, said: "As machine learning adoption continues to grow rapidly across industries, data science and machine learningor DSMLis evolving from just focusing on predictive models, toward a more democratized, dynamic and data-centric discipline.

This is now also fueled by the fervor around generative AI. While potential risks are emerging, so too are the many new capabilities and use cases for data scientists and their organizations.

In addition to Cloud Data Ecosystems and Edge AI, Gartner cited three other key trends: Responsible AI, Data-Centric AI, and Accelerated AI Investment.

The Summit heard that Data Ecosystems are moving from self-contained software or blended deployments to full cloud-native solutions. By 2024, Gartner expects 50% of new system deployments in the cloud will be based on a cohesive cloud data ecosystem rather than manually integrated point solutions.

Demand for Edge AI is growing to enable data processing at the point of creation at the edge, helping organizations gain real-time insights, detect new patterns and meet stringent data privacy requirements. Edge AI also helps organizations improve AI development, orchestration, integration and deployment.

Gartner predicts that more than 55% of all data analysis by deep neural networks will occur at the point of capture in an edge system by 2025, up from less than 10% in 2021. Organizations should identify the applications, AI training and inferencing required to move to edge environments near IoT endpoints.

On Responsible AI, Gartner said this trend made AI a positive force rather than a threat to society and itself. Gartner predicts the concentration of pre-trained AI models among 1% of AI vendors by 2025 will make responsible AI a societal concern.

Data-centric AI represents a shift from a model and code-centric approach to being more data-focused to build better AI systems.

The use of generative AI to create synthetic data is one area that is rapidly growing, relieving the burden of obtaining real-world data so machine learning models can be trained effectively. By 2024, Gartner predicts 60% of data for AI will be synthetic to simulate reality, future scenarios, and de-risk AI, up from 1% in 2021.

Gartner also forecasts that investment in AI will continue to accelerate by organizations implementing solutions and industries looking to grow through AI technologies and AI-based businesses.

By the end of 2026, Gartner predicts that more than USD10 billion will have been invested in AI startups that rely on foundation models large AI models trained on vast amounts of data.

Image credit: iStockphoto/NicoElNino

View post:

Data Science More Democratized and Dynamic: Gartner - CDOTrends

Read More..

No 10 scientist who helped during Covid pandemic dies in mountain … – The Independent

Get the free Morning Headlines email for news from our reporters across the worldSign up to our free Morning Headlines email

A leading scientist who helped guide Britain through the Covid-19 pandemic has been killed in a cycling accident near Lake Garda in Italy.

Susannah Boddie, 27, was a lead health data scientist at No 10 Downing Street during the pandemic, having joined their data science team in February 2021.

She suffered fatal injuries while cycling on a wooded path on the Brescia side of the lake at around 10am on Saturday.

Local reports said that Ms Boddie, from Henley-on-Thames in Oxfordshire, had been travelling down a steep downhill trail when she was thrown from the bike, with her partner calling paramedics.

Despite their best efforts, she was pronounced dead at the scene.

In a statement, her heartbroken family said: Susannah lived life to the full and had achieved so much in her short life. She crammed in more into her life than you would have thought possible.

She was the loveliest, kindest person who always inspired and cared for others and was adored by all her many friends. She will leave the biggest hole in our family and that of Rob her much-loved partner.

She was the most wonderful daughter, sister, granddaughter and friend you could ever wish for and her memory will continue to inspire us in all we do.

A Downing Street spokeswoman added: Susannah was an incredible scientist, an inspiring sportswoman, a loved and admired colleague and friend to those at No10 and many others within the civil service.

Our thoughts are with her family at this difficult time.

Her LinkedIn account shows that she graduated from Cambridge University in 2018 with a degree in pharmacology, before she went on to study a masters degree in systems biology.

Ms Boddie and her partner are said to have recently finished a tour of the Dolomites and were due to fly home this week from nearby Verona.

A source told the Italian newspaper Il Giorno: Its a very steep trail and although the woman was wearing a helmet she was thrown quite violently and there was nothing that could be done.

Its not a tarmacked road, its a gravel track so it can be a bit tricky getting down there.

A spokesperson from the Italian police earlier said: I can confirm that a 27-year-old British woman has died after an accident while cycling near Lake Garda. The circumstances are still being investigated and officers are preparing a report.

The womans partner raised the alarm and he was taken to hospital but was not injured although very shocked."

Read more:

No 10 scientist who helped during Covid pandemic dies in mountain ... - The Independent

Read More..

A Guide to Using ggmap in R – Built In

InR, ggmap is a packagethat allows users to retrieve and visualize spatial data from Google Maps, Stamen Maps or other similar map services. With it, you can create maps that display your data in a meaningful way, providing a powerful tool for data visualization and exploration.

Adding spatial and map capabilities with ggmap can be a great way to enhance your data science or analytics projects. Whether you want to showcase some examples on a map or because you have some geographical features to build algorithms, having the ability to combine data and maps is a great asset for any developer.

ggmap is an R package that allows users to access visual data from maps like Google Maps or Stamen Maps and display it. Its a useful tool for working with spatial data and creating maps for data visualization and exploration.

In this post, Ill be taking you on a journey through the world of ggmap, exploring its features and capabilities and showing you how it can transform the way you work with spatial data. Whether youre a data scientist, a geographic information specialist professional or simply someone with an interest in maps and spatial analysis, this post should help you to grasp the basic concepts of the ggmap package using the R programming language.

Lets get started.

To use ggmap, you first need to install the package in R. This can be done by running the following command:

Once the package is installed, you can load it into your R session by running the usual library command:

The first thing we need to do is to create a map. We can do that using get_map(), a function to retrieve maps using R code.

More on RGrouping Data With R

This function takes a number of arguments that allow you to specify the location, type (such as street, satellite and terrain, etc.) and the source of a map. For example, the following code retrieves a street map of Lisbon, using Stamen as a source:

In ggmaps, you can also use Google Maps as your source. To do that, we will need to set up a Google Maps API key, which we will cover later in the article.

When you run the code above, you will notice something odd in the output:

This happens because the argument location uses Google Maps to translate the location to tiles. To use get_map without relying on Google Maps, well need to rely on the library osmdata:

Now, we can use the function getbb, get boundary box, and feed it to the first argument of the function:

With the code above, weve just downloaded our map into an R variable. By using this variable, well be able to plot our downloaded map using ggplot-like features.

Once we have retrieved the map, we can use the ggmap()function to view it. This function takes the map object that we created with get_map(), and plots it in a 2D format:

On the x-axis, we have the longitude of our map, while in the y we are able to see the latitude. We can also ask for other types of maps, by providing a maptype argument on get_map:

Also, you can pass coordinates to the function as arguments. However, in order to use this function, you will need to set up your Google Maps API.

Unlocking Google Maps as a source for the ggmap package will give you access to a range of useful features. With the API set up, you will be able to use the get_map function to retrieve map images based on specified coordinates, as well as unlocking new types and sizes of the map image.

Using ggmap without Google Maps will prevent you from using a bunch of different features that are very important, such as:

So lets continue to use ggmap, but this time using Google services by providing a Google Maps API key. To use it, you need to have a billing address active on Google Cloud Platform, so proceed at your own risk. I also recommend you set up billing alerts, in case Google Maps API pricing changes in the future.

To register your API key in R, you can use the register_google function:

Now, you can ask for several cool things, for example, satellite maps:

Visualizing our new map:

We can also tweak the zoom parameter for extra detail on our satellite:

Another familiar map that we can access with google services is the famous roadmap:

Now, we can also provide coordinates to location, instead of a named version:

The madrid_sat map is a map centered on Madrid. We gave the Madrid coordinates to get_map by passing a vector with longitude and latitude to the location argument.

So far, weve seen the great features regarding map visualization using the ggmap. But, of course, these maps should be able to integrate with our R data. Next, well cover the most interesting part of this post, mixing R Data with our ggmap.

You can also use ggmap to overlay your own data on the map. First, we will create a sample DataFrame, and then use the geom_point()functions from the ggplot2 package to add those coordinates to the map.

Lets create a data frame from two famous locations in Portugal, Torre de Belm and Terreiro do Pao, and then use geom_point() to add those locations to the map:

We can now overlay our lisbon_locations on top of ggmap:

Also, we can rely on geom_segmentto connect the two dots:

With just a few lines of code, we can easily retrieve and visualize spatial data from Google Maps and add our coordinates to it, providing a valuable tool for data exploration and visualization.Finally, we can also work with shapefiles and other complex coordinate data in our plot. For example, Ill add the Lisbon Cycling Roads to the map above by reading a shapefile and plotting it using geom_polygon:

Although not perfect, the map above is a pretty detailed visualization of Lisbons cycling areas, with just a few errors. Weve built this map by:

An error occurred.

More on RThe Ultimate Guide to Logical Operators in R

The ggmap package in R provides a useful tool for working with spatial data and creating maps for data visualization and exploration. It allows users to retrieve and visualize maps from various sources, such as Google Maps, Stamen Maps, and others, and provides options for customizing the map type, location and size.

Additionally, setting up a Google Maps API can unlock additional features, such as the ability to transform coordinates into map queries and access different types of maps. Overall, incorporating spatial and map capabilities into data science projects can greatly enhance your data storytelling skills and provide valuable insights.

Read the rest here:

A Guide to Using ggmap in R - Built In

Read More..

Grand challenges in bioinformatics education and training – Nature.com

Asif M. Khan

Present address: College of Computing and Information Technology, University of Doha for Science and Technology, Doha, Qatar

These authors contributed equally: Esra Bra Ik, Michelle D. Brazas, Russell Schwartz.

Deceased: Christian Schnbach.

Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Istanbul, Turkey

Esra Bra Ik&Asif M. Khan

APBioNET.org, Singapore, Singapore

Esra Bra Ik,Harpreet Singh,Hilyatuz Zahroh,Maurice Ling&Asif M. Khan

Ontario Institute for Cancer Research, Toronto, Ontario, Canada

Michelle D. Brazas

Bioinformatics.ca, Toronto, Ontario, Canada

Michelle D. Brazas

Carnegie Mellon University, Pittsburgh, PA, USA

Russell Schwartz

School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia

Bruno Gaeta

Swiss Institute of Bioinformatics, Lausanne, Switzerland

Patricia M. Palagi

Dutch Techcentre for Life Sciences, Utrecht, the Netherlands

Celia W. G. van Gelder

Amrita School of Biotechnology, Amrita Vishwa Vidyapeetham, Clappana, India

Prashanth Suravajhala

Bioclues.org, Hyderabad, India

Prashanth Suravajhala&Harpreet Singh

Department of Bioinformatics, Hans Raj Mahila Maha Vidyalaya, Jalandhar, India

Harpreet Singh

European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK

Sarah L. Morgan

Genetics Research Centre, Universitas YARSI, Jakarta, Indonesia

Hilyatuz Zahroh

School of Applied Science, Temasek Polytechnic, Singapore, Singapore

Maurice Ling

Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, Luxembourg

Venkata P. Satagopam

International Society for Computational Biology, Leesburg, VA, USA

Venkata P. Satagopam

CSIRO Data61, Brisbane, Queensland, Australia

Annette McGrath

Institute of Medical Science, University of Tokyo, Tokyo, Japan

Kenta Nakai

Department of Biochemistry, YLL School of Medicine, National University of Singapore, Singapore, Singapore

Tin Wee Tan

National Supercomputing Centre, Singapore, Singapore

Tin Wee Tan

State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center and Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, Peking University, Beijing, China

Ge Gao

Computational Biology Division, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa

Nicola Mulder

Department of Biology, School of Sciences and Humanities, Nazarbayev University, Astana, Kazakhstan

Christian Schnbach

School of Landscape and Horticulture, Yunnan Agricultural University, Kunming, China

Yun Zheng

Cancer Research Center, Spanish National Research Council, University of Salamanca & Institute for Biomedical Research of Salamanca, Salamanca, Spain

Javier De Las Rivas

Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia

Asif M. Khan

Conceptualization: M.D.B., R.S., B.G., P.M.P., C.W.G.v.G., P.S., H.S., S.L.M., H.Z., M.L., A.M., K.N., T.W.T., G.G., A.M.K. Writing original draft preparation: E.B.I., M.D.B., R.S., B.G., P.M.P., C.W.G.v.G., P.S., H.S., S.L.M., H.Z., A.M.K. Writing review and editing: E.B.I., M.D.B., R.S., B.G., P.M.P., C.W.G.v.G., P.S., H.S., S.L.M., H.Z., M.L., V.P.S., A.M., K.N., T.W.T., G.G., N.M., C.S., Y.Z., J.D.L.R., A.M.K. Project administration: E.B.I., M.D.B., R.S., A.M.K.

Read more from the original source:

Grand challenges in bioinformatics education and training - Nature.com

Read More..

Optimizing GPT Prompts for Data Science | by Andrea Valenzuela … – Medium

DataCamp Tutorial on Prompt EngineeringSelf-made image. Tutorial cover.

Its been a week and in ForCodeSake we still have emotional hangover!

Last Friday 28th of July, we run our first ForCodeSake online tutorial on Prompt Engineering. The tutorial was organized by DataCamp as part of their series of webinars about GPT models.

As the first online debut of ForCodeSake, we decided to show different techniques to optimize the queries when using GPT models in Data Science or when building powered-LLM applications. Concretely, the tutorial had three main goals:

Learn the principles of Good Prompting.

Learn how to standardize and test the quality of your prompts at scale.

Learn how to moderate AI responses to ensure quality.

Feeling like you would like to follow the tutorial too?In this short article, we aim to provide the pointers to the courses material for you to benefit from the full experience.

To follow the webinar, you need to have an active OpenAI account with access to the API and generate an OpenAI API Key. No idea where to start? Then the following article is for your!

Have you ever received lackluster responses from ChatGPT?

Before solely attributing it to the models performance, have you considered the role your prompts play in determining the quality of the outputs?

GPT models have showcased mind-blowing performance across a wide range of applications. However, the quality of the models completion doesnt solely depend on the model itself; it also depends on the quality of the given prompt.

The secret to obtaining the best possible completion from the model lies in understanding how GPT models interpret user input and generate responses, enabling you to craft your prompt accordingly.

More:

Optimizing GPT Prompts for Data Science | by Andrea Valenzuela ... - Medium

Read More..