Category Archives: Data Mining
Core Scientific to expand Texas Bitcoin mining data center by 72MW – DatacenterDynamics
Bitcoin mining and digital infrastructure provider Core Scientific is planning a 72MW expansion of its data center in Denton, Texas.
The expansion will take place on the companys 31-acre site, originally built in 2022. The facility currently operates 125MW of Bitcoin mining capacity, expandable to 300MW at full build-out.
Other specifications of the expansion have not been disclosed, but completion is expected by Q2 2024.
Adam Sullivan, CEO at Core Scientific, said: Owning and controlling all of our infrastructure with access to ready power gives us the strategic optionality to expand our mining capacity, deploy upgrades to our proprietary mining technology stack, reallocate miners to optimize for efficiency and even flex to alternative forms of compute when such opportunities arise.
The company added that the expansion program will deliver more than 20 additional exahash of mining capacity, for an average cost of $200,000 per megawatt.
Core Scientific has another Texas data center, located in Pecos, offering 71MW of capacity over a 100-acre site. The company says its Pecos site can scale up to 234MW.
The company has five other US data centers in North Dakota, Kentucky, Georgia, and North Carolina. Combined, Core Scientific says its facilities have a live supply of 745MW and 372MW of pipeline supply.
Uniminers, another cryptomine data center provider, has announced it has broken ground on a data center in Ethiopia.
The facility is set to offer 100MW of IT capacity in its first phase, scheduled for completion in autumn 2024, and host approximately 24,000 high-performance ASIC miners, including models such as the Antminer S21, S21 Hydro, T21, and S19.
Other specifications, such as the location, have not been shared.
Batyr Hydyrov, president at Uniminers, said: The increasing complexity of mining necessitates greater investment in robust equipment. The industrial-scale mining for extracting the remaining bitcoins has never been more pertinent.
He added: The battle for accessing sufficient electrical power is intensifying globally as the slowdown in miner sales worldwide is mainly due to scarcity of installation sites and available power capacity.
The company said Ethiopia offers many geographical benefits, including its affordable, eco-friendly hydroelectric power. Plans are also in place for Uniminers to expand its footprint across Africa, the Middle East, and South America.
Headquartered in Guangzhou, China, and founded in 2017, Uminers has a total capacity of 90MW in data centers across Hong Kong, Oman, the UAE, the US, and Singapore.
Ethiopia has seen a recent spike in development, with Wingu.Africa, Red Fox, Sun Data World, and ScutiX all developing data centers in the country. Pan-African operator Raxio Data Centres also launched a new facility in Ethiopia in November last year.
Original post:
Core Scientific to expand Texas Bitcoin mining data center by 72MW - DatacenterDynamics
Power of Data: Dive Into the Best Analytics Books of 2024 – Analytics Insight
In this day and age, where data is everything, data analytics helps organizations to positively influence the things they do by improving the decisions they make, revealing possible opportunities that can be failures and also the risks they are yet to prevent. Data analytics delivers businesses with a data-driven approach that helps them understand the customers needs, adjust their marketing strategy accordingly, and, as a result, improve overall performance. Therefore, the number of competent analysts being called for has risen significantly over the last couple of years. Show interest in the list of best analytics books for 2024.
This book covers in-depth studies on how to solve comparable challenges in various contexts and the language-based Python data work technique. The methods described in detail throughout the analytic book include loading, cleaning, merging, reordering, and altering data from Pandas and Numpy libraries.
This book contains a step-by-step guide on data analytics. It provides a generic process for using the methodology to analyze a problem in whatever business people are engaged in. Master the fundamentals of data mining, machine learning, and, most importantly, reasoning to become a data-driven decision-maker.
The essential elements of reporting, business intelligence, data visualization, and descriptive statistics are emphasized throughout the text. I wrote this guide with two primary purposes: (1) the programs it includes can be helpful while implementing practical apps, and (2) the exercises included are to be completed in Python. Therefore, it becomes a topic of inquiry in which not only three basic machine learning methods, regression, classification, and clustering, are highlighted.
Data Analytics 101 is a starter pack for data literacy, which is the process of turning data into intelligence. The book describes how one collects and arranges data for machines so that they can interpret it as needed, mentioning crucial machine-learning algorithms such as regression, classification, and clustering, among others.
SQL for Data Analysis begins by teaching the attendees about enhancing their SQL skills, and then they go on to show the attendees how to use SQL within their workflow. It also has some advanced techniques that help to transform data into insights, which cover tender join, window function, subquery, and regular expression.
This booklet is an essential tool for Excel users to learn how to analyze data and the components of the data structure. One of the crucial chapters is devoted to critical statistical techniques, including practical exercises with spreadsheets. The author also gives valuable advice on transitioning to the usage of Python and R tools for data analysis and hypothesis testing.
Modern data analytics in Excel will be discussed, thus eliminating the limit of traditional data processing and expanding the possibilities of data representation intuitively. Through the Author, the reader learns how to drive datasets by using Power Command and Powerdition for repeatable data segmentation and the creation of non-relational data models and analysis measures. In the book, there is also space given to the use of AI and Python in Excel reporting, which is very advanced.
The reader gains knowledge on how to use Excel for data analysis and relevant reporting of massive amounts of data. It streamlines the tedious process of creating reports and analysis and builds upon the models of data visualization.
This book is a hands-on experience where you learn tools for data analysis and apply them to support managerial, economic, and policy decisions. The book has been arranged by subject to include data wrangling, regression analysis, and causal analysis, and there are also real-world data case studies.
For business executives, Storied With Data is one of the best books as far as illustrating various data visualization techniques is concerned. Furthermore, the writer says that though the target event may be a topic as massive as the universe, it will be so finely chiseled that it will leave an everlasting mark on the reader.
Continued here:
Power of Data: Dive Into the Best Analytics Books of 2024 - Analytics Insight
JV video: GeologicAI rethinks mining data and cuts carbon emissions – The Northern Miner
The Bill Gates-backed firm GeologicAI has seen a rapid uptake of its technology among global mining majors, demonstrating a 5% reduction in carbon dioxide emissions at operations in Africa, the company says.
GeologicAI uses artificial intelligence (AI), new data types and superior algorithms to scan rocks and gather information for thorough mineral analyses, founder and president Grant Sanden explained during an interview.
The technology aims to handle uncertainty in resource estimation, thereby significantly improving the economics and environmental efficiency of mining operations, Sanden told The Northern Miners western editor, Henry Lazenby, during the recent Prospectors and Developers Association of Canadas convention in Toronto.
GeologicAI in 2023 merged with another service provider, RMS (Resource Modelling Solutions), enhancing their data analytics and algorithms in mining.
Watch the full interview below.
Joint venture videos are paid-for content in arrangement with The Northern Miner.
Go here to see the original:
JV video: GeologicAI rethinks mining data and cuts carbon emissions - The Northern Miner
Ethiopia Set to Pioneer Bitcoin Mining in Africa – Tech in Africa
Last Thursday, February 15, Ethiopian Investment Holdings (EIH), the investment arm of the Ethiopian Government, signed a memorandum of understanding with Hong Kong-based West Data Groups Center Service PLC to initiate bitcoin mining operations.
The collaboration is governed by a comprehensive agreement for a significant $250 million data mining initiative aimed at establishing state-of-the-art infrastructure for data mining and artificial intelligence training operations in Ethiopia, as stated by the EIH. Ethiopia is strategically positioning itself as a frontrunner in the data center sector in Africa, which is projected to reach $5.4 billion by 2027, according to Aritzon Advisory and Intelligence.
According to Kal Kassa, CEO for Ethiopia at Hashlabs Mining, the development is aligned with the Ethiopian Governments goal of boosting economic growth through the strategic use of technology and energy resources to attract foreign investments.
The EIH has not confirmed the specifics of its bitcoin mining operations or provided any comments to other publications. However, as the project develops, we anticipate receiving more detailed information from them regarding the arrangement.
In 2022, despite the ban on crypto trading in the country, there was a significant development with the ratification of favorable data mining laws allowing for high-performance computing and data mining, encompassing bitcoin mining. This shift has attracted a surge of miners due to the comparatively positive reception towards bitcoin mining, abundant hydro-based energy sources, optimal weather conditions, and cost-efficient energy.
In 2023, Ethiopia emerged as the fourth leading destination for Bitcoin mining rigs, trailing only the USA, Hong Kong, and Asia, as reported by Bitcoin mining services company Luxor Technologies. The country has witnessed significant developments in this sector, with Russian company Bitcluster establishing a 120 MW bitcoin mining facility and Hashlabs Mining commencing the construction of bitcoin mines to cater to global clients.
According to a forecast by a senior executive at Bitmain and reported by Bloomberg, Ethiopias energy potential could soon rival Texass generation capacity. Currently, Texas accounts for an impressive 28.5% of the USs 40% global hash rate.
However, bitcoin miners are expressing caution regarding the future of regulation in the country. As demonstrated in other regions, bitcoin mining is not immune to changing regulatory landscapes. It is still premature to anticipate whether Ethiopia will follow the lead of Iran and Kazakhstan, altering its stance on bitcoin mining as they did when faced with competition from domestic energy demand.
In any case, the government is committed to increasing its foreign currency reserves to address its economic challenges. It sees mining as a promising investment opportunity to achieve this objective.
With the energy infrastructure still struggling to meet the demand for electricity in many African countries, bitcoin mining has emerged as an appealing solution to provide power to millions.
In Ethiopia, more than 40% of its population, approximately 120 million people, lack access to electricity. However, the country boasts an installed capacity generation of over 5,000 MW, with plans for an additional capacity generation of approximately 5,150 MW upon the completion of the Grand Ethiopian Renaissance Dam (GERD), the largest hydroelectric project in Africa.
Drawing inspiration from successful mining projects in Africa like Gridless and Trojan Mining, Ethiopia has the opportunity to harness its surplus green energy to power bitcoin mining operations and provide electricity to its citizens. This pioneering initiative could set a precedent for other African nations with similar energy resources, offering a viable solution to common economic challenges.
Moreover, integrating bitcoin mining into the Ethiopian economy has the potential to contribute $2 to $4 billion to its GDP, as per data from Project Mano. This open collective aims to educate the government on the economic benefits of bitcoin for the country.
Ethiopia stands to gain immense economic advantages by strategically harnessing its abundant energy resources for Bitcoin mining. Its only a matter of time before other African nations follow suit, tapping into this lucrative opportunity.
Bitcoin mining has the potential to significantly contribute to addressing the economic challenges faced by several African countries.
Source
Link:
Ethiopia Set to Pioneer Bitcoin Mining in Africa - Tech in Africa
10 Python Libraries for Data Cleaning and Preprocessing – Analytics Insight
In data science, data cleaning and preprocessing are key steps in preparing raw data for analysis and modeling. Pythons vast ecosystem of libraries provides severaltools to assist with these tasks. In this article, well explore the top 10 Python libraries for data cleaning and preprocessing, providing insights into their features, benefits, and recommendations for optimizing your data analysis workflow.
Pandas is a robust data manipulation librarythat offers high-performance, user-friendly data structures and analytical tools in Python. Pandas enables users to import, clean, transform, and analyze structured data efficiently. It offers flexible data structures such as DataFrames and Series, along with a wide range of functions for data cleaning, preprocessing, and exploration. Pandas is a versatile library that is commonly used in data science projects for tasks such asdata cleaning, filtering, grouping, and visualization.
NumPy is a fundamental library for numerical computing in Python, including multidimensional arrays and matrices. It provides a diverse set of mathematical functions and operations for data manipulation, such as array manipulation, linear algebra, statistical analysis, and random number generation. NumPys array-based computing capabilities make it ideal for data preparation tasks including datanormalization, scaling, and transformation. It is a core component of the Python scientific computing ecosystem and is often used in conjunction with other libraries such as Pandas and Matplotlib.
SciPy is an open-source Python library for scientific computing that includes a variety of functions and algorithms for numerical optimization, integration, interpolation, and signal processing. It builds upon NumPy and provides additional functionality for scientific and technical computing tasks.SciPys optimization and interpolation methods are very useful for data preparation tasks like feature engineering, dimension reduction, and data imputation. It is popularly usedin data science and machine learning projects due to its extensive collectionof algorithms and tools.
scikit-learn is a versatile Python machine-learninglibrary that offers simple and efficient tools for data mining and analysis. It provides a wide rangeof algorithms for classification, regression, clustering, dimensionality reduction, and model selection. scikit-learns preprocessing module includes functions for data scaling, normalization, encoding categorical variables, and handling missing values. It is widely used in data preprocessing pipelines for machine learning tasks and provides a consistent interface for building and evaluating predictive models.
TensorFlow Data Validation (TFDV) is a library for exploring and validating datasets for machine learning. It includes tools for assessing the characteristics of datasets, detecting abnormalities, and determining data quality issues. TFDVs features include schema inference, data drift detection, and anomaly detection, making it useful for data cleaning and preprocessing tasks. It is often used in conjunction with TensorFlow Extended (TFX) for building end-to-end machine learning pipelines.
Feature-Engine is a Python library that facilitates feature engineering and selection in machine learning projects. It includes a wide range of transformers for data preprocessing tasks such as handling missing values, encoding category variables, and scaling numerical features. Feature-Engines transformers can be easily integrated into scikit-learn pipelines, making it a a convenient tool for building data preprocessing workflows. It is designed to be fast, flexible, and easy to use, making it suitable for both beginners and experienced data scientists.
Dora is a Python library for data preprocessing and supports exploratory data analysis (EDA). It provides a set of functions and utilities for visualizing and understanding datasets, identifying patterns and trends, and preparing data for analysis. Doras features include data cleaning, transformation, and visualization, making it a versatile tool for data preprocessing. It is developed on top of Pandas and offers an easy-to-use interface for data exploration and manipulation.
Pyjanitor is a Python library for data cleaning and preparation, inspired by the R packagejanitor. It includes a suite of functions and utilities for cleaning messy datasets, handling missing values, and reshaping data. Pyjanitor provides functions for renaming columns, deleting duplicates,converting data types, and performing group-wise operations. It is designed to be simple, expressive, and easy to use, making it a useful tool for data cleaning and preprocessing tasks.
Featuretools is a Python library for automatedfeature engineering and feature selection. IIt provides tools for creating new features from existing data, identifying relevant features for machine learning tasks, and building feature sets for predictive modeling. Featuretools automated featureengineering capabilities can considerably minimize the time and effort needed for data preprocessing tasks. It is particularly useful for handling complex datasets with multiple tables and relationships.
Dask is a flexible parallel computing library for Python that provides scalable data processing capabilities. It enables users to parallelize data processing tasks across numerous cores and nodes, making it ideal for handling enormous datasets that cannot be stored in memory. Dasks DataFrame and Array data structures are compatible with Pandas and NumPy, allowing users to leverage their familiar APIs for data preprocessing tasks. It is particularly useful for distributed data preprocessing tasks in cloud computing environments.
These ten Python libraries provide powerful tools and utilities for data cleaning and preprocessing, allowing data scientists to streamline their data analysis workflow and prepare datasets for machine learning tasks.Data scientists can use these libraries to efficiently manage data cleaning, transformation, and exploration tasks, enabling them to focus on building and deploying predictive models and extracting valuable insights from their data.
Join our WhatsApp and Telegram Community to Get Regular Top Tech Updates
The rest is here:
10 Python Libraries for Data Cleaning and Preprocessing - Analytics Insight
EIA Enters Agreement Tied to Collection of Cryptocurrency Mining Data – American Public Power Association
The U.S. Energy Information Administration has entered into an agreement that stems from recent litigation related to EIAs plan to collect data tied to the electricity consumption associated with cryptocurrency mining activity.
EIA on Feb. 1 detailed its plans to focus on evaluating the electricity consumption associated with cryptocurrency mining activity. Given the dynamic and rapid growth of cryptocurrency mining activity in the United States, we will be conducting a mandatory survey focused on systematically evaluating the electricity consumption associated with cryptocurrency mining activity, which is required to better inform planning decisions and educate the public, EIA said inan analysis posted on its website.
The Texas Blockchain Council alongside one of its members, Riot Platforms, on Feb. 22initiated legal proceedingsagainst EIA, challenging an alleged unprecedented and illegal data collection demand against thebitcoinmining industry.
A Texas judge on Feb. 23 granted a temporary restraining order in a proceeding involving the Energy Information Administrations recently announced plan to collect data tied to the electricity consumption associated with cryptocurrency mining activity.
Under the March 1 agreement in principle, EIA has agree to destroy any information that it has already received in response to the EIA-862 Emergency Survey. If EIA receives additional information in response to the EIA-862 Emergency Survey, EIA will destroy that data. EIA will sequester and keep confidential any information it has received or will receive in response to the EIA-862 Emergency Survey until it is destroyed.
EIA will also publish in the Federal Register a new notice of a proposed collection of information that will withdraw and replace a February 9 Notice.
EIA will allow for submission of comments for 60 days, beginning on the date of publication of the new Federal Register notice.
Read more from the original source:
GIS-based non-grain cultivated land susceptibility prediction using data mining methods | Scientific Reports – Nature.com
Research flow
The NCL susceptibility prediction study includes four main parts: (1) screening and analysis of the influencing factors of NCL; (2) construction of the NCL susceptibility prediction model; (3) NCL susceptibility prediction; and (4) evaluation of the prediction results. The Research flow is shown in Fig.2.
The NCL locations were obtained based on information of Google Earth interpretation, field survey, and data released by local government, which derived in a total of 184 NCL locations. For determining the non-NCL locations, GIS software was applied, and 184 locations were randomly selected. In order to decreasing the bias of modeling, we generated non-NCL points by 200m distance for NCL. At each point, the data was divided into training samples and testing samples in a ratio of 7/3, thus forming the training dataset and the testing dataset together (Fig.3).
Currently, there is no unified consensus on the factors influencing NCL. Therefore, based on historical research materials and on-site field investigations24,25,26,27,28, 16 appropriate Non-grain Cultivated Land Susceptibility conditioning factors (NCLSCFs) were chosen for modelling NCL susceptibility in accordance with topographical, geological, hydrological, climatological and environmental situations. Alongside this, a systematic literature review has also been performed on NCL modelling to aid in the identification of the most suitable NCLSCFs for this study. The NCLSCF maps were shown in Fig.4.
Typical NCL factors map: (a) Slope; (b) Aspect; (c) Plan curvature; (d) Profile curvature; (e) TWI; (f) SPI; (g) Rainfall; (h) Drainage density; (i) Distance from river; (j) Lithology; (k) Fault density; (l) Distance from fault; (m) Landuse; (n) Soil; (o) Distance from road.
(1) Topographical factors
The occurrences of NCL and their recurrent frequency are very much dependent on topographical factors of an area. Several topographical factors like slope, elevation, curvature, etc. are triggering parameters for the development of NCL activities29. Here, six topographical factors were chosen: altitude, slope, aspect, plan and profile curvature and topographic wetness index (TWI). All these factors also perform a considerable part in NCL development in study area. These factors were prepared using shuttle radar topographical mission (SRTM) sensor digital elevation model (DEM) data with 30m resolution in the ArcGIS software. The output topographical factors of altitude ranges from 895 to 3289m (Fig.3), slope map 0261.61%, aspect map has nine directions (flat, north, northeast, east, southeast, south, southwest, west, northwest), plan curvature12.59 to 13.40, profile curvature13.05 to 12.68 and TWI 4.96 to 24.75. The following equation was applied to compute TWI:
$$TWI = Lnfrac{propto }{mathrm{tanbeta }+mathrm{ C}}$$
(1)
where,specifies flow accumulation, specifies slope and C is the constant value (0.01).
(2) Hydrological factors
Sub-surface hydrology is treated as the activating mechanism for the happening of NCL, as water performs a significant part in the soil moisture content. Therefore, four hydrological factors, namely drainage density, distance from river, stream power index (SPI) and annual rainfall, for modelling NCL susceptibility were chosen30. Here, SRTM DEM data of 30m spatial resolution was used to map the first three hydrological variables. Drainage density and distance from river map was prepared using line density extension and Euclidean extension tool respectively in GIS platform. The following formula was applied to compute SPI.
$$SPI = As*tan beta$$
(2)
where, As specifies the definite catchment area in square meters and specifies the slope angle in degrees. The precipitation map of the area was derived from the statistics of 19 climatological stations around the province with a statistical period of 25years and in accordance with the kriging interpolation method in GIS platform. The output drainage density value ranges from 0 to 1.68km/km2. Meanwhile, the value of distance from river ranges between 0 and 9153.93m, average annual rainfall varies from 175 to 459.98mm and the value of SPI ranges from 0 to 8.44.
(3) Geological factors
The characteristics of rock mass, i.e., lithological characteristics of an area, significantly impact on NCL activities31. Therefore, in NCL susceptibility studies geological factors are indeed commonly used as input parameters to optimize NCL prediction assessment. In the current study, three geological factors (namely lithology, fault density and distance from fault) were chosen. The lithological map and fault lines were obtained in accordance with the geological map of study gathered from local government at a scale of 1:100,000. Fault density and distance from fault factor map was prepared using line density extension and extension tool respectively in GIS platform. In this area, the value of fault density varies from 0 to 0.54km/km2 and distance from fault ranges from 0 to 28,247.1m respectively. The lithological map in this area is presented in Fig.4b.
(4) Environmental factors
Several environmental factors can also be significant triggering factors for NCL occurrence in mountainous or hilly regions32. Here, land use land cover (LULC), soil and distance from road were selected as environmental variables for predicting of NCL susceptibility. The LULC map was obtained in accordance with Landsat OLI 8 satellite images applying the maximum probability algorithm in the ENVI. Soil texture map was prepared based on the soil map of study area. The road map of this area was digitized from the topographical map by the local government. The output LULC factor was classified into six land use classes, while the soil map was classified into eight soil texture groups and the value of distance from road ranges from 0 to 31,248.1m.
As the NCLSCFs are selected artificially and their dimensions, as well as the quantification methods of data, are derived through mathematical operations, as subsequent input data for modeling, there may be potential multicollinearity problems among the NCLSCFs33. Such problems arise due to precise or highly correlated relationships between NCLSCFs, which can lead to model distortion or difficulty in estimation. In light of this, to avoid potential multicollinearity problems, this study examines the variance inflation factor and tolerance index to assess whether there exists multicollinearity among the NCLSCFs.
The MC analysis was conducted among the chosen NCLSCFs to optimize the NCL susceptibility model and its predictions34. TOL and VIF statistical tool were used to test MC using SPSS software. Studies indicate that there is a multicollinearity issue if VIF value is>5 and TOL value is<0.10. TOL and VIF were measured applying the following formula:
$$TOL=1-{R}_{j}^{2}$$
(3)
$$VIF=frac{1}{TOL}$$
(4)
where, R2 represents a regression value of j on other various factors.
This section details the machine learning models of GBM and XGB, as used in NCL susceptibility studies.
In prediction performance analysis, GBM is one of the most popular machine learning methods, more frequently applied by researchers in different fields and treated as a supervised classification technique. A variety of classification and regression issues are also often solved by the GBM method, which was first proposed by Friedman35. This model is based on the ensemble of different weak prediction models such as decision trees, and is therefore considered as one of the most important prediction models. Three components are required in GBM model, namely a loss operate, a weak learner prediction, and an optimization of the loss function in which an additive function is necessary to include weak learners within the model. In addition to the above mentioned components, three important tuning parameters (namely n-tree, tree depth and shrinkage, i.e., the maximum number of trees, highest possible interaction among the independent variables and the learning rate respectively) is also required to build a GBM model36. The advantage of such a model is that it has capacity to determine the loss function and weak learners in a precise way. It is complex to obtain the solution of optimal estimation applying the loss function of (y, f) and weak learner of (x, ). Thus, to solve this problem, a new operate (x, t) was planned to negative gradient {gt(xi)}i=1 along with the observed data:
$${g}_{t}(x) ={{E}_{y} [frac{apsi (y,f(x))}{af(x)}|x]}_{f(x)={f}^{t-1}(x)}$$
(5)
This new operate is greatly associated with(x). This algorithm can permit us to develop aleast square minimization from the method by applying the following equation:
$$(mathrm{rho t},mathrm{ theta t})=mathrm{arg min}sum_{i=1}^{N}{[-{text{gt}}(mathrm{xi }) +mathrm{ rho h}(mathrm{xi },uptheta ]}^{2}$$
(6)
Chen & Guestrin then went on to introduce the XGB algorithm. It indicates the advance machine learning method, and is more efficient than the others37. The algorithm of XGB is based on classification trees and the gradient boosting structure. Gradient boosting framework is used in an XGB model by the function of parallel tree boosting. This algorithm is chiefly applied for boosting the operation of different classification trees. A classification tree is usually made up of various regulations to classify each input factor as the function of prejudice variables in a plot construction. This plot is developed as a individual tree and leaves are appointed with respective scores, which convey and choose the respective factor class, i.e., categorical or ordinal. The loss function is used in the XGB algorithm to train the ensemble model; this is known as regularization, which deals specifically with the severity of complexity trees38. Therefore, this regularization method can significantly enhance the performance of prediction analysis through alleviating any over-fitting problems. The boosting method, with the combination of weak learners, is used in XGB algorithm to optimally predict the result. Three parameters (i.e., General, Task and Booster) are applied to separate XGB models. The weighted averages of several tree models are then combined to form the output result in XGB. The following optimization function was applied to form the XGBoost model:
$$OF(theta ) =sum_{i=1}^{n}lleft({{text{y}}}_{i}, {overline{y} }_{i}right)+sum_{k=1}^{k}upomega ({f}_{K})$$
(7)
where, (sum_{i=1}^{n}lleft({{text{y}}}_{i}, {overline{y} }_{i}right)) is the optimization loss function of training dataset, (sum_{k=1}^{k}upomega ({f}_{K})) is the regularization of the over-fitting phenomenon, K indicates the number of individual trees, fk is the ensemble of trees, and ({overline{y} }_{i}) and ({{text{y}}}_{i}) indicates the actual and predicted output variables respectively.
Kennedy, an American social psychologist, developed the PSO algorithm based on the vector depending of seeking food by birds and their eating behavior39. It is a meta-heuristic-based simulation of a social model, often applied in behavioral studies of fish schooling, birds and swarming theory. The non-linear problems in our day-to-day research study will be solved by applying this PSO method. The PSO algorithm has been widely applied to determine the greatest achievable direction or direction to collect food, specifically for bird and fish intelligence. Here, birds are treated as particles, and they always search for an optimal result to the issue. In this model, bird is considered an individual, and the swarm is treated as a group like other evolutionary algorithms. The particles always try to locate the best possible solution for a respective problem using n-dimensional space, where n indicates the respective problems several parameters40. PSO consists of two fundamental principles: position and speed. This is the basic principle for the movement of each particle.
Hence, xt=(xt, xt,, xt) and vt=(vt, vt, , vt) is the position and speed for the changing particle position which is designed for ith particle in tth iteration. The given formula are used for the ith particle position and speed in (t+1)th iteration.
Where, xt is the previous ith position; pt is the most excellent position; gt is the best position; r1 and r2 indicates the random numbers within 0 and 1; is weights of inertia; c1 is coefficient and c2 is the social coefficient. Several type of methods are presented to weight the assignment of respective particles. Among them, standard 2011 PSO is the most popular and has been widely used among previous researchers. Here, standard 2011 PSO was used to calculate particles weight assignment using the following formula:
$$omega =frac{1}{2ln2}and {c}_{1}={c}_{2}=0.5+ln2$$
(8)
Evaluation is an important action to quantify the accuracy of each output method. In other words, the superiority of the output model is specified through a validation assessment41. Studies indicate that several statistical techniques can be applied to evaluate the accuracy of the algorithms; among them, the most frequently used technique is receiver operating characteristics-area under curve (ROC-AUC). Here, statistical techniques of sensitivity (SST), specificity (SPF), positive predictive value (PPV), negative predictive value (NPV) and ROC- AUC were all applied to validate and assess the accuracy of the models. These statistical techniques were computed in accordance with the four indices, i.e., true positive (TP), true negative (TN), false positive (FP) and false negative (FN)42. In this, correctly and incorrectly identified NCL susceptibility zones are represented through TP and FP, and correctly and incorrectly identified non-NCL susceptibility zones are represented through TN and FN respectively. The ROC is mostly used as a standard process to evaluate the accuracy of the methods. It is based on even and non-even phenomena. The output result of these techniques is such that a higher value represents good performance by the model, and a lower value represents poor performance. Applied statistical techniques of this study were measured through the following formula:
$${text{SST}}=frac{{text{TP}}}{mathrm{TP }+mathrm{ FN}}$$
(9)
$${text{SPF}}=frac{{text{TN}}}{mathrm{FP }+mathrm{ TN}}$$
(10)
$${text{PPV}}=frac{{text{TP}}}{mathrm{FP }+mathrm{ TP}}$$
(11)
$${text{NPV}}=frac{{text{TP}}}{mathrm{TP }+mathrm{ FN}}$$
(12)
$$AUC=frac{mathrm{Sigma TP }+mathrm{ Sigma TN}}{mathrm{P }+mathrm{ N}}$$
(13)
See more here:
Data mining the archives | Opinion – Chemistry World
History including the history of science has a narrative tradition. Even if the historians research has involved a dive into archival material such as demographic statistics or political budgets to find quantitative support for a thesis, the stories it tells are best expressed in words, not graphs. Typically, any mathematics it requires would hardly tax an able school student.
But there are some aspects of history that only a sophisticated analysis of quantitative data can reveal. That was made clear in a 2019 study by researchers in Leipzig, Germany,1 who used the Reaxys database of chemical compounds to analyse the growth in the number of substances documented in scientific journals between 1800 and 2015. They found that this number has grown exponentially, with an annual rate of 4.4% on average.
And by inspecting the products made, the researchers identified three regimes, which they call proto-organic (before 1861), organic (1861 to 1980) and organometallic (from 1981). Each of these periods is characterised by a change a progressive decrease in the variability or volatility of the annual figures.
Theres more that can be gleaned from those data, but the key points are twofold. First, while the conclusions might seem retrospectively consistent with what one might expect, only precise quantification, not anecdotal inspection of the literature, could reveal them. It is almost as if all the advances in both theory (the emergence of structural theory and of the quantum description of the chemical bond, say) and in techniques dont matter so much in the end to what chemists make, or at least to their productivity in making. (Perhaps unsurprisingly, the two world wars mattered more to that, albeit transiently.)
Such a measure speaks to the unusual ontological stability of chemistry
Second, chemistry might be uniquely favoured among the sciences for this sort of quantitative study. It is hard to imagine any comparable index to gauge the progress of physics or biology. The expansion of known chemical space is arguably a crude measure of what it is that chemists do and know, but it surely counts for something. And as Guillermo Restrepo, one of the 2019 studys authors and an organiser of a recent meeting at the Max Planck Institute for Mathematics in the Sciences in Leipzig on quantitative approaches to the history of chemistry, says, the existence of such a measure speaks to the unusual ontological stability of chemistry: since John Daltons atomic theory at the start of the 19th century, it has been consistently predicated on the idea that chemical compounds are combinations of atomic elemental constituents.
Still, there are other ways to mine historical evidence for quantitative insights into the history of science often now aided by AI techniques. Matteo Valleriani of the Max Planck Institute for the History of Science in Berlin, Germany, and his colleagues have used such methods to compare the texts of printed Renaissance books that used parts of the treatise on astronomy by the 13th century scholar Johannes de Sacrobosco. The study elucidated how relationships between publishers, and the sheer mechanics of the printing process (where old plates might be reused for convenience), influenced the spread and the nature of scientific knowledge in this period.
And by using computer-assisted linguistic analysis of texts in the Philosophical Transactions of the Royal Society in the 18th and 19th centuries, Stefania Degaetano-Ortlieb of Saarland University in Germany and colleagues have identified the impact of Antoine Lavoisiers new chemical terminology from around the 1790s. This amounts to more than seeing new words appear in the lexicon: the statistics of word frequencies and placings disclose the evolving norms and expectations of the scientific community. At the other end of the historical trajectory, an analysis of the recent chemical literature by Marisol Bermdez-Montaa of Tecnolgico de Monterrey in Mexico reveals the dramatic hegemony of China in the study of rare-earth chemistry since around 2003.
All this work depends on accessibility of archival data, and it was a common refrain at the meeting that this cant be taken for granted. As historian of science Jeffrey Johnson of Villanova University in Pennsylvania, US, pointed out at the meeting, there is a private chemical space explored by companies who keep their results (including negative findings) proprietary. And researchers studying the history of Russian and Soviet chemistry have, for obvious geopolitical reasons, had to shift their efforts elsewhere and for who knows how long?
But even seemingly minor changes to archives might matter to historians: Robin Hendry of Durham University in the UK mentioned how the university librarys understandable decision to throw out paper copies of old journals that are available online obliterates tell-tale clues for historians of which pages were well-thumbed. The recent cyberattacks on the British Library remind us of the vulnerability of digitised records. We cant take it for granted that the digital age will have the longevity or the information content of the paper age.
Originally posted here:
Top 14 Data Mining Tools You Need to Know in 2024 and Why – Simplilearn
Driven by the proliferation of internet-connected sensors and devices, the world today is producing data at a dramatic pace, like never before. While one part of the globe is sleeping, the other part is beginning its day with Skype meetings, web searches, online shopping, and social media interactions. This literally means that data generation, on a global scale, is a never-ceasing process.
A report published by cloud software company DOMO on the amount of data that the virtual world generates per minute will shock any person. According to DOMO's study, each minute, the Internet population posts 511,200 tweets, watches 4,500,000 YouTube videos, creates 277,777 Instagram stories, sends 4,800,000 gifs, takes 9,772 Uber rides, makes 231,840 Skype calls, and transfers more than 162,037 payments via mobile payment app, Venmo.
With such massive volumes of digital data being captured every minute, most forward-looking organizations are keen to leverage advanced methodologies to extract critical insights from data, which facilitates better-informed decisions that boost profits. This is where data mining tools and technologies come into play.
Data mining involves a range of methods and approaches to analyze large sets of data to extract business insights. Data mining starts soon after the collection of data in data warehouses, and it covers everything from the cleansing of data to creating a visualization of the discoveries gained from the data.
Also known as "Knowledge Discovery," data mining typically refers to in-depth analysis of vast datasets that exist in varied emerging domains, such as Artificial Intelligence, Big Data, and Machine Learning. The process searches for trends, patterns, associations, and anomalies in data that enable enterprises to streamline operations, augment customer experiences, predict the future, and create more value.
The key stages involved in data mining include:
Data scientists employ a variety of data mining tools and techniques for different types of data mining tasks, such as cleaning, organizing, structuring, analyzing, and visualizing data. Here's a list of both paid and open-source data mining tools you should know about in 2024.
One of the best open-source data mining tools on the market, Apache Mahout, developed by the Apache Foundation, primarily focuses on collaborative filtering, clustering, and classification of data. Written in the object-oriented, class-based programming language JAVA, Apache Mahout incorporates useful JAVA libraries that help data professionals perform diverse mathematical operations, including statistics and linear algebra.
The top features of Apache Mahout are:
Dundas BI is one of the most comprehensive data mining tools used to generate quick insights and facilitate rapid integrations. The high-caliber data mining software leverages relational data mining methods, and it places more emphasis on developing clearly-defined data structures that simplify the processing, analysis, and reporting of data.
Key features of Dundas BI include:
Teradata, also known as the Teradata Database, is a top-rated data mining tool that features an enterprise-grade data warehouse for seamless data management and data mining. The market-leading data mining software, which can differentiate between "cold" and "hot" data, is predominately used to get insights into business-critical data related to customer preferences, product positioning, and sales.
The main attributes of Teradata are:
The SAS Data Mining Tool is a software application developed by the Statistical Analysis System (SAS) Institute for high-level data mining, analysis, and data management. Ideal for text mining and optimization, the widely-adopted tool can mine data, manage data, and do statistical analysis to provide users with accurate insights that facilitate timely and informed decision-making.
Some of the core features of the SAS Data Mining Tool include:
The SPSS Modeler software suite was originally owned by SPSS Inc. but was later acquired by the International Business Machines Corporation (IBM). The SPSS software, which is now an IBM product, allows users to use data mining algorithms to develop predictive models without any programming. The popular data mining tool is available in two flavors - IBM SPSS Modeler Professional and IBM SPSS Modeler Premium, incorporating additional features for entity analytics and text analytics.
The primary features of IBM SPSS Modeler are:
One of the most well-known open-source data mining tools written in JAVA, DataMelt integrates a state-of-the-art visualization and computational platform that makes data mining easy. The all-in-one DataMelt tool, integrating robust mathematical and scientific libraries, is mainly used for statistical analysis and data visualization in domains dealing with massive data volumes, such as financial markets.
The most prominent DataMelt features include:
A GUI-based, open-source data mining tool, Rattle leverages the R programming language's powerful statistical computing abilities to deliver valuable, actionable insights. With Rattle's built-in code tab, users can create duplicate code for GUI activities, review it, and extend the log code without any restrictions.
Key features of the Rattle data mining tool include:
One of the most-trusted data mining tools on the market, Oracle's data mining platform, powered by the Oracle database, provides data analysts with top-notch algorithms for specialized analytics, data classification, prediction, and regression, enabling them to uncover insightful data patterns that help make better market predictions, detect fraud, and identify cross-selling opportunities.
The main strengths of Oracle's data mining tool are:
Fit for both small and large enterprises, Sisense allows data analysts to combine data from multiple sources to develop a repository. The first-rate data mining tool incorporates widgets as well as drag and drop features, which streamline the process of refining and analyzing data. Users can select different widgets to quickly generate reports in a variety of formats, including line charts, bar graphs, and pie charts.
Highlights of the Sisense data mining tool are:
RapidMiner stands out as a robust and flexible data science platform, offering a unified space for data preparation, machine learning, deep learning, text mining, and predictive analytics. Catering to both technical experts and novices, it features a user-friendly visual interface that simplifies the creation of analytical processes, eliminating the need for in-depth programming skills.
Key features of RapidMiner include:
KNIME (Konstanz Information Miner) is an open-source data analytics, reporting, and integration platform allowing users to create data flows visually, selectively execute some or all analysis steps, and inspect the results through interactive views and models. KNIME is particularly noted for its ability to incorporate various components for machine learning and data mining through its modular data pipelining concept.
Key features include:
Orange is a comprehensive toolkit for data visualization, machine learning, and data mining, available as open-source software. It showcases a user-friendly visual programming interface that facilitates quick, exploratory, and qualitative data analysis along with dynamic data visualization. Tailored to be user-friendly for beginners while robust enough for experts, Orange democratizes data analysis, making it more accessible to everyone.
Key features of Orange include:
H2O is a scalable, open-source platform for machine learning and predictive analytics designed to operate in memory and across distributed systems. It enables the construction of machine learning models on vast datasets, along with straightforward deployment of those models within an enterprise setting. While H2O's foundational codebase is Java, it offers accessibility through APIs in Python, R, and Scala, catering to various developers and data scientists.
Key features include:
Zoho Analytics offers a user-friendly BI and data analytics platform that empowers you to craft visually stunning data visualizations and comprehensive dashboards quickly. Tailored for businesses big and small, it simplifies the process of data analysis, allowing users to effortlessly generate reports and dashboards.
Key features include:
The demand for data professionals who know how to mine data is on the rise. On the one hand, there is an abundance of job opportunities and, on the other, a severe talent shortage. To make the most of this situation, gain the right skills, and get certified by an industry-recognized institution like Simplilearn.
Simplilearn, the leading online bootcamp and certification course provider, has partnered with Caltech and IBM to bring you the Post Graduate Program In Data Science, designed to transform you into a data scientist in just twelve months.
Ranked number-one by the Economic Times, Simplilearn's Data Science Program covers in great detail the most in-demand skills related to data mining and data analytics, such as machine learning algorithms, data visualization, NLP concepts, Tableau, R, and Python, via interactive learning models, hands-on training, and industry projects.
Read the original here:
Top 14 Data Mining Tools You Need to Know in 2024 and Why - Simplilearn
Ethiopia to start mining Bitcoin through new data mining partnership – CryptoSlate
The Ethiopian government is set to begin mining Bitcoin through a new partnership with Data Center Service a subsidiary of West Data Group, according to Ethiopia-based Hashlabs Mining CEO Kal Kassa.
The partnership was announced by the countrys sovereign wealth fund, Ethiopian Investment Holdings (EIH) on Feb. 15.
Under the collaboration, the sovereign wealth fund will invest $250 million in establishing cutting-edge infrastructure for data mining and artificial intelligence (AI) training operations in Ethiopia.
Kassa said the deal includes setting up Bitcoin mining operations using Canaan Avalon miners and is part of the countrys broader strategy to leverage its technological and energy resources to attract international investment and foster economic growth.
However, the government has yet to confirm the news officially.EIH did not respond to a request for comment as of press time.
The news comes amid a spike in miner activity due to the impending halving, which is less than 65 days away and set to reduce mining rewards by 50%. Many miners have already begun expansion efforts to position themselves appropriately.
The venture is not without its challenges and controversies, particularly concerning the energy-intensive nature of Bitcoin mining.
Theres an ongoing debate about the impact of such operations on local electricity supply, especially in a country where energy access remains a pressing issue for a significant portion of the population.
Despite these concerns, the Ethiopian governments move towards regulating cryptographic products, including mining, reflects a cautious yet optimistic approach to embracing the potential economic benefits of Bitcoin mining.
This regulatory framework aims to ensure that the sectors growth does not come at the expense of the countrys energy security or environmental commitments.
The new rules have paved the way for mining companies to set up shop in the country. Recent media reports revealed a significant increase in Chinese miners moving to the country as part of the BRICS movement.
There has been a notable influx of Chinese miners in Ethiopia over the past few months, drawn by the countrys strategic initiatives and favorable conditions.
The trend is part of a larger movement that has seen Chinese Bitcoin mining operations relocate in response to regulatory pressures at home and the search for cost-effective, regulatory-friendly environments abroad.
Ethiopias low electricity costs, primarily due to the Grand Ethiopian Renaissance Dam, represent a primary lure for Chinese miners. This factor, coupled with the Ethiopian governments openness to technological investments and its efforts to foster a conducive environment for high-performance computing and data mining, has made the country an attractive destination for these operations.
The dams role in providing affordable, renewable energy aligns with the miners needs for sustainable and economically viable power sources for their energy-intensive operations.
The arrival of Chinese miners is underpinned by broader geopolitical and economic considerations. Chinas increasing involvement in Ethiopia, characterized by significant investments across various sectors, has established a solid foundation for such ventures.
The relationship is further reinforced by Ethiopias strategic importance to China as a partner in Africa, offering Chinese companies a hospitable environment for expanding their operations, including Bitcoin mining.
Link:
Ethiopia to start mining Bitcoin through new data mining partnership - CryptoSlate