Chapter: Three. Geospatial Databases and Gegevens Mining
Ter ADDITION TO READING ONLINE, THIS TITLE IS AVAILABLE Te Thesis FORMATS:
MyNAP members SAVE 10% off online.
Not a MyNAP member yet? Register for a free account to embark saving and receiving special member only perks.
- Original PagesText Pages
- Get This Book
Geospatial Databases and Gegevens Mining
Spatiotemporal gegevens, dynamic gegevens, and location-aware computing present significant opportunities for research ter the geospatial database and gegevens mining arenas. Current database mechanisms use very elementary representations of geographic objects and relationships (e.g., point objects, polygons, and Euclidean distances). Gegevens structures, queries, indexes, and algorithms need to be expanded to treat other geographic objects (e.g., objects that stir and evolve overheen time) and relationships (e.g., non-Euclidean distances, direction, and connectivity) (Miller and Han, 2001). One of the most serious challenges is integrating time into database representations. Another is integrating geospatial gegevens sets from numerous sources (often with varied formats, semantics, precision, coordinate systems, and so forward).
Gegevens mining is an iterative process that attempts to samenvatting from gegevens useful information, patterns, and trends that were previously unknown. Albeit gegevens mining is a relatively fresh area of research, its roots lie ter several more established disciplines, including database management, machine learning, statistics, high-performance computing, and information retrieval. The main impetus behind the growth of gegevens mining wasgoed the need to synthesize large amounts of gegevens into skill. Despite the importance and proliferation of geospatial gegevens, most research ter gegevens mining has focused on transactional or documentary gegevens. 1
From a white paper, &ldquo,Gegevens Mining Technologies for Geospatial Applications,&rdquo, ready for the committee&rsquo,s workshop by Dimitrios Gunopulos.
This chapter probes the current state of research and key future challenges ter geospatial databases, algorithms, and geospatial gegevens mining. Advances te thesis areas could have a fine effect on how geospatial gegevens are accessed and mined to facilitate skill discovery.
TECHNOLOGIES AND TRENDS
This section outlines key developments ter database management systems and gegevens mining technologies spil they relate to geospatial gegevens.
Database Management Systems
The ubiquity and longevity of the relational database architecture are due largely to its solid theoretical foundation, the declarative nature of the query processing language, and its capability to truly separate the structure of the gegevens from the software applications that manipulate them. With the relational monster it is possible for applications to manipulate gegevens&mdash,query, update, add fresh information, and so forward&mdash,independent of the database implementation. This abstraction of the database to a conceptual proefje is the hallmark of all modern database technologies. By separating the application logic from the database implementation, the specimen makes it possible to accommodate switches&mdash,for example, ter the physical organization of the gegevens&mdash,without disturbing the application software or the users&rsquo, logical view of the gegevens. This separation also means that efforts made to optimize spectacle or ensure sturdy recovery will instantly benefit all applications.
Overheen the past two decades, the relational prototype has bot extended to support the notion of persistent software objects, which duo gegevens structures to sets of software procedures referred to spil methods. Many commercial applications rely on elementary gegevens types (e.g., integers, real numbers, date/time, and character strings) and do not require the functionality provided by software objects and their methods. Two Geodata, however,
The scope of software operations that can be performed on a gegevens factor is restricted by the type of gegevens. Plain arithmetic operations such spil add, subtract, multiply, and divide can be performed on oprecht numbers (such spil Five, Ten, and 225) but cannot be performed on character strings such spil &ldquo,National Academy of Sciences.&rdquo, Conversely, operations that can be performed on character strings (e.g., convert a string of characters to uppercase letters or search for a sequence of characters) cannot be performed on integers. The database management system is aware of which operations are supported for each gegevens type, thus, the system permits the multiplication of two integers to form a third but issues an error when an attempt is made to multiply two strings. For the plain gegevens types (integers, real numbers, strings, etc.), the suite of operations for each gegevens type is well known and implemented by virtually all database and programming systems.
typically require powerful software objects to implement the rich behavior demanded by geospatial applications. Typical geospatial operations include &ldquo,length,&rdquo, &ldquo,area,&rdquo, &ldquo,overlap,&rdquo, &ldquo,within,&rdquo, &ldquo,contains,&rdquo, and &ldquo,intersects.&rdquo, Geographic Information Systems (GIS) have employed relational database management systems for years and more recently have begun to use the object-relational database management system (DBMS). Trio However, exchanging gegevens inbetween systems is difficult because of the lack of accepted standards, Four the multitude of proprietary formats, and the multitude of gegevens models used te geospatial applications.
Geospatial Gegevens Mining Tasks
The aim of gegevens mining Five is to expose some type of interesting structure te the target gegevens. This might be a pattern that designates some type of regularity or deviation from randomness, such spil the daily or yearly temperature cycle at a given location. Gegevens mining may be structured using a top-down or bottom-up treatment. Generally, a top-down treatment is used to test a hypothesis, the most challenging facet is the development of a good specimen that can be used to validate the premise. For example, patterns can be described te some form of statistical proefje that is fitted to the gegevens, such spil a fractal dimension for a self-similar gegevens set, a regression prototype for a time series, a hidden Markov monster, or a belief network. A bottom-up treatment, on the other palm, searches the gegevens for frequently occurring patterns or behaviors&mdash,or, conversely, anomalous or infrequent patterns. Most of the examples of geospatial applications described ter this report tend to go after a bottom-up treatment of ex
The Open GIS Consortium (OGC), whose members are leading geospatial vendors, users, and consultants, has published a standard describing the gegevens types and their methods that should be implemented within an object-relational database system to support geospatial applications (OGC Elementary Features for SQL).
For example, the Office of Management and Budget (OMB) recently announced a revision to Circular No. A-16 (which describes the responsibilities of federal agencies with respect to coordination of surveying, mapping, and related spatial gegevens activities) to standardize geospatial gegevens collected by the government. OMB argues that the lack of standard definitions of terms (e.g., scientists may differ on the distinction inbetween a brook and a creek) has become a barrier to sharing gegevens among organizations. Features such spil boundaries, hydrography, and elevation will be included te the list of standard terms (Bhambhani, 2002). For more information on Circular No. A-16, see <,http://www.whitehouse.gov/omb/circulars/a016/a016.html>,.
The committee notes that because there are no generally accepted standards for gegevens mining terminology, other papers and books may use different terms for the concepts voiced te this report.
ploratory analysis 6 (and visualization) of results from computational models.
Geospatial gegevens mining is a subfield of gegevens mining worried with the discovery of patterns te geospatial databases. Applying traditional gegevens mining technics to geospatial gegevens can result te patterns that are biased or that do not getraind the gegevens well. 7 Chawla et hoewel. highlight three reasons that geospatial gegevens pose fresh challenges to gegevens mining tasks: &ldquo,Very first, classical gegevens mining…deals with numbers and categories. Te tegenstelling, spatial gegevens is more ingewikkeld and includes extended objects such spil points, lines, and polygons. 2nd, classical gegevens mining works with explicit inputs, whereas spatial predicates (e.g., overlap) are often implicit. Third, classical gegevens mining treats each input to be independent of other inputs whereas spatial patterns often exhibit continuity and high auto-correlation among nearby features.&rdquo, 8 Chawla et hoewel. suggest that gegevens mining tasks be extended to overeenkomst with the unique characteristics intrinsic to geospatial gegevens.
There are many different gegevens mining tasks and many ways to categorize them. A thorough survey 9 of geospatial gegevens mining tasks is beyond the scope of this report, instead, the committee chose to highlight four of the most common gegevens mining tasks: clustering, classification, association rules, and outlier detection.
&ldquo,Clustering&rdquo, attempts to identify natural clusters te a gegevens set. It does this by partitioning the entities ter the gegevens such that each partition consists of entities that are close (or similar), according to some distance (similarity) function based on entity attributes. Conversely, entities te different partitions are relatively far chic (dissimilar). Because the objective is to discern structure te the gegevens, the results of a clustering are then examined by a domain accomplished to see if the groups suggest something. For example, crop production gegevens from an agricultural region may be clustered according to various combinations of factors, including soil type, cumula-
There are also significant issues on how to make decisions, using the collected and mined geospatial gegevens. Albeit this topic (called &ldquo,confirmatory&rdquo, analysis ter statistics) is very significant, the committee focused on &ldquo,exploratory&rdquo, analysis of gegevens mining for two reasons. Very first, geospatial gegevens mining has many unsolved problems, which lie ter the intersection of geospatial gegevens and information technology. 2nd, this area wasgoed a key concern for the workshop participants.
From Han et nu., &ldquo,Spatial Clustering Methods te Gegevens Mining,&rdquo, te Miller and Han (2001).
From Chawla et alhoewel., &ldquo,Modelling Dependencies for Geospatial Gegevens,&rdquo, te Miller and Han (2001).
John F. Roddick, Kathleen Hornsby, and Myra Spiliopoulou maintain an online bibliography of temporal, spatial, and spatiotemporal gegevens mining research at <,http://kdm.very first.flinders.edu.au/IDM/STDMBib.html>,.
tive rainfall, average low temperature, solar radiation, availability of irrigation, strain of seed used, and type of fertilizer applied. Interpretation by a domain pro is needed to determine whether a discerned pattern&mdash, such spil a propensity for high yields to be associated with strong applications of fertilizer&mdash,is meaningful, because other factors may actually be responsible (e.g., if the fertilizer is water soluble and rainfall has bot strenuous). Many clustering algorithms that work well on traditional gegevens deteriorate when executed on geospatial gegevens (which often are characterized by a high number of attributes or dimensions), resulting te enlargened running times or poor-quality clusters. Ten For this reason, latest research has centered on the development of clustering methods for large, very dimensioned gegevens sets, particularly technics that execute te linear time spil a function of input size or that require only one or two passes through the gegevens. Recently developed spatial clustering methods that seem particularly adequate for geospatial gegevens include partitioning, hierarchical, density-based, grid-based, and cluster-based analysis. 11
Whereas clustering is based on analysis of similarities and differences among entities, &ldquo,classification&rdquo, constructs a monster based on inferences drawn from gegevens on available entities and uses it to make predictions about other entities. For example, suppose the aim is to classify forest plots ter terms of their propensity for landslides. Given historical gegevens on the locations of past glides and the corresponding environmental attributes (ground voorkant, weather conditions, proximity to roads and flows, land use, etc.), a classification algorithm can be applied to predict which existing plots are at high risk or whether a planned series of fresh plots will be at risk under certain future conditions. Various classification methods have bot developed ter machine learning, statistics, databases, and neural networks, one of the most successful is decision trees. Spatial classification algorithms determine membership based on the attribute values of each spatial object spil well spil spatial dependency on its neighbors. 12
&ldquo,Association rules&rdquo, attempt to find correlations (actually, frequent co-occurrences) among gegevens. For example, the association rules method could detect a correlation of the form &ldquo,forested areas that have broadleaf hardwoods and occurrences of standing water also have mosquitoes.&rdquo, Spatial association rules include spatial predicates&mdash,such spil topological, distance,
From Han et alhoewel., &ldquo,Spatial Clustering Methods te Gegevens Mining,&rdquo, te Miller and Han (2001).
From Ester et nu., &ldquo,Algorithms and Applications for Spatial Gegevens Mining,&rdquo, te Miller and Han (2001).
or directional relations&mdash,ter the precedent or antecedent (Miller and Han, 2001). Several fresh directions have bot proposed, including extensions for quantitative rules, extensions for temporal event mining, testing the statistical significance of rules, and deriving minimal rules (Han and Kamber, 2000).
&ldquo,Outlier detection&rdquo, involves identifying gegevens items that are atypical or unusual. Ng suggests that the distance-based outlier analysis method could be applied to spatiotemporal trajectories to identify abnormal movement patterns through a geographic space. 13 Signifying geospatial gegevens for use te outlier analysis remains a difficult problem.
Typically, two or more gegevens mining tasks are combined to explore the characteristics of gegevens and identify meaningful patterns. A key challenge is that, spil Thuraisingham (1999) argues, &ldquo,Gegevens mining is still more or less an kunst.&rdquo, It is unlikely to say with certainty that a particular technology will always be effective te obtaining a given outcome, or that certain sequences of tasks are most likely to yield results given certain gegevens characteristics. Consequently, high levels of practice and expertise are required to apply gegevens mining effectively, and the process is largely trial and error. Research to establish stiff methodologies for when and how to perform gegevens mining will be needed before this fresh technology can become mainstream for geospatial applications. The development of geospatial-specific gegevens mining tasks and technologies will be increasingly significant to help people analyze and interpret the vast amount of geospatial gegevens being captured.
This chapter is worried with how geospatial gegevens can be stored, managed, and mined to support geospatial-temporal applications te general and gegevens mining te particular. A very first set of research topics stems from the nature of spatiotemporal databases. Albeit there has bot some research on both spatial and temporal databases, relatively little research has addressed the more ingewikkeld issues associated with spatiotemporal characteristics. Te addition, research investments are needed ter geometric algorithms to manipulate efficiently the massive amounts of geospatial gegevens being generated and stored. Despite advances te gegevens mining methods overheen the past decade, considerable work remains to be done to improve the discovery of structure (te the form of rules, patterns, regularities, or models) ter geospatial databases.
From Raymond T. Ng, &ldquo,Detecting Outliers from Large Datasets,&rdquo, te Miller and Han (2001).
Geospatial databases are an significant enabling technology for the types of applications introduced earlier. However, relational DBMSs are not adequate for storing and manipulating geospatial gegevens because of the ingewikkeld structure of geometric information and the intricate topological relationship among sets of spatially related objects (Grumbach, Rigaux, and Segoufin, 1998). For example, the limitation te relational DBMSs to the use of standard alphanumeric gegevens types compels a geospatial gegevens object (such spil a cloud) to be decomposed into elementary components that voorwaarde be distributed overheen several rows. This complicates the formulation and efficiency of queries on such ingewikkeld objects. Also, geospatial gegevens often span a region ter continuous space and time, but computers can only store and manipulate finite, discrete approximations, which can cause inconsistencies and erroneous conclusions. A particularly difficult problem for geospatial gegevens is indicating both spatial and temporal features of objects that budge and evolve continuously overheen time. To specimen geographic space, an ontology of geospatial objects vereiste be developed. The final key problem is integrating geospatial gegevens from heterogeneous sources into one samenhangend gegevens set.
Moving and Evolving Objects
Objects ter the real world budge and evolve overheen time. Examples include hurricanes, pollution clouds, pods of migrating whales, and the extent and rate of shrinking of the Amazon rain forest. Objects may evolve continuously or at discrete instants. Their movement may be along a route or te a two- or three-dimensional continuum. Objects with spatial extent may split or merge (e.g., two separate forest fires may merge into one). Existing technologies for database management systems (gegevens models, query languages, indexing, and query processing strategies) vereiste be modified explicitly to accommodate objects that stir and switch form overheen time (see Opbergruimte Three.1). Such extensions should adhere to the recognized advantages of databases&mdash,high-level query mechanisms, gegevens independence, optimized processing algorithms, concurrency control, and recovery mechanisms&mdash,and to the kinds of emerging applications used spil examples te this report.
Albeit many different geospatial gegevens models have bot proposed, no commonly accepted comprehensive monster exists. 14 One key treatment
For information on other spatiotemporal monster approaches, see Gü,ting et ofschoon. (2000).
Opbergruimte Trio.1 The Complexity of Spatiotemporal Gegevens
Despite significant advances te gegevens modeling, much geospatial information still cannot be fully represented digitally. Most of the space-time gegevens models proposed te the past decade rely on the time-stamping of gegevens objects or values, the same way that time is treated ter nonspatial databases. Only ter latest years has it bot recognized that space and time should not always be seen spil two orthogonal dimensions. Many researchers advocate a different treatment for modeling geographic reality, using events and processes to integrate space and time. Signifying events and processes is not a trivial task, however, even at the conceptual level. Complexity arises because scale te space and time affects entity identification.
Depending on the scale of observation, events and processes can be identified spil individual entities or spil an aggregate. For example, a thunderstorm vuurlijn can be seen spil one event or spil numerous convective storms whose number, geometry, location, and existence may switch overheen time. Whereas events and processes operate at certain spatial and temporal scales, their behaviors are somewhat managed by events and processes operating at larger scales. Similarly, their behaviors not only affect other events and processes at their scale but also somewhat control those operating at smaller scales. Associations among events and processes at different scales vereiste be represented so they can be fully voiced. This means that ter addition to retrieving objects, events, and processes, a geodatabase vereiste support calculations that will expose and summarize their embedded spatiotemporal characteristics.
Another significant representational kwestie ter spatial analysis is the effect when gegevens is aggregated overheen spatial zones. The heterogeneity of microdata patterns within a zone interacts with the zonal boundaries and size, making it difficult to determine what actually has bot analyzed. Further, analysis and interpretation should consider larger-scale geographic entities that are related to the zone of rente, not just the microdata within the zone. Gegevens structures are needed that can provide linkages among related gegevens at different scales and enable the dynamic subdivision of zonal gegevens.
Geospatial objects need to be structured accordingly ter semantic, spatial, and temporal hierarchies. Semantically related geospatial entities (e.g., census tracts, neighborhoods, and towns) will then be lightly associated ter space and time, so their properties can be cross-examined at numerous scales. This treatment will be increasingly significant spil spatial analysis is automated te response to the growing volume of spatiotemporal gegevens.
SOURCE: Adapted from a white paper, &ldquo,Research Challenges and Opportunities on Geospatial Representation and Gegevens Structure,&rdquo, ready for the committee&rsquo,s workshop by May Yuan.
is to extend traditional relational databases with geospatial gegevens structures, types, relations, and operations. Several commercial systems are now available with spatial and/or temporal extensions, however, they are not comprehensive (i.e., they still require application-specific extensions), strafgevangenis do they accurately prototype both spatial and temporal features of objects that stir continuously. Most of the research ter moving-object databases has concentrated on modeling the locations of moving objects spil points (instead of regions). This is the treatment used ter many industrial applications, such spil fleet management and the automatic location of vehicles. Wolfson notes that the point-location management method has several drawbacks, the most significant being that it does not enable interpolation or extrapolation. 15 Researchers are beginning to explore fresh gegevens models. For example, Wolfson has proposed a fresh proefje, outlined ter Opbergruimte Two.1 (Chapter Two), that captures the essential aspects of the moving-object location spil a four-dimensional linear function (two-dimensional space ×, time ×, uncertainty) and a set of operators for accessing databases of trajectories. Uncertainty is unavoidable because the precies position of a moving and evolving object is, at best, only accurate at the precies ogenblik of update, inbetween updates, the object&rsquo,s location voorwaarde be estimated based on previous behavior. Further, it is problematic to determine how often and under what conditions an object&rsquo,s representation te the database should be switched to reflect its switching real-world attributes. 16 Spil mentioned ter Opbergruimte Two.1, frequent location updates would ensure greater accuracy te the location of the object but consume more scarce resources such spil bandwidth and processing power.
Gü,ting and his colleagues have proposed an abstract prototype for implementing a spatiotemporal DBMS extension. They argue that their framework has several unique aspects, including a comprehensive prototype of geospatial gegevens types (beyond just topological relationships) formulated at the abstract infinite point-set level, a process that deals systematically and coherently with continuous functions spil values of attribute gegevens types, and an emphasis on genericity, closure, and consistency (Gü,ting et hoewel., 2000). They suggest that more research is needed to extend their specimen from moving objects te two-dimensional (2D) space to moving volumes and their projections into space (Gü,ting et alreeds., 2000). A 2nd treatment is based on the constraint paradigm. DEDALE, one example of a constraint database system for geospatial gegevens proposed by the Chorochronos Participants
From a white paper, &ldquo,The Opportunities and Challenges of Location Information Management,&rdquo, ready for the committee&rsquo,s workshop by Ouri Wolfson.