Enhancing human mobility research with open and standardized datasets – Nature.com

Human mobility datasets are produced from raw geolocation data through a series of pre-processing steps, the details of which are often not disclosed to those outside the research team that conducted the analysis, as shown in Fig. 1. Pre-processing steps are conducted (1) to de-noise the data and remove GPS drifts, (2) to correct for any potential biases in the mobility data, (3) to enrich the datas semantic information, and (4) to comply with privacy standards. Location data, often originally collected for marketing and business rather than research purposes, typically contain various biases. These include, but are not limited to, demographic biases (such as age, income and race), geographic biases (such as urban versus rural areas, developed versus developing countries), and behavioral biases, where observations may be more frequent during certain activities, such as checking into points of interest (POIs)11.

We argue for the need for fit-for-purpose and standardized human mobility benchmark datasets for reproducible, fair and inclusive human mobility research. CBG, census block group; POI, point of interest.

Moreover, to enrich the datas semantic information for further analyses, various pre-processing steps are applied to the dataset, including user cutoff and selection, stop detection, privacy enhancement, attribution of points of interest and other contexts, and transport mode estimation. Each of these steps requires the selection of multiple parameters by the data analyst. For example, to detect a stop within a mobility trajectory, data scientists need to define arbitrary hyperparameters such as the minimum number of minutes spent at the stop and the maximum movement distance allowed from the stop centroid. With several hyperparameters needed for each pre-processing step, a slight change in the selection of these parameters could result in a very different processed human mobility dataset.

The complexity of human mobility data processing makes it difficult for data users, including researchers and analysts, to keep track of all of the decisions that were made during the pre-processing steps. Moreover, thanks to the proprietary nature of raw and processed human mobility datasets, disclosing the details of the pre-processing methods may not be sufficient to grasp the full characteristics of the human mobility data with which the downstream tasks were conducted. This lack of transparency about the quality of processed human mobility datasets raises critical issues in human mobility research, including the lack of replicability, generalizability, and comparability of method performance. Researchers may claim state-of-the-art prediction results on specific datasets, potentially leading to overfitting and loss of generalizability. To address the lack of transparency on the validity of mobility data, several data companies (such as Unacast, Safegraph, and Cuebiq), as well as scientific papers12 have evaluated and reported the accuracy of human mobility datasets through comparisons with available external data, such as the American Community Survey and visitation patterns to stadiums and factory facilities.

See original here:

Enhancing human mobility research with open and standardized datasets - Nature.com

Related Posts

Comments are closed.