Address Matching and Geocoding
There are numerous sources of data that have attributes that are inherently
spatial in nature such as addresses, or x,y coordinates. We think of addresses
and coordinate pairs as spatial data however they are not. They describe
where something is but are useless for spatial analysis. GeoMedia contains
three commands that allow you to convert text descriptions of locations,
addresses, intersections, or coordinate pairs or triplets, into actual
map locations. Address matching is the process of calculating the latitude
and longitude of a location from a street address. Geocoding takes addresses
or x,y coordinates as input, calculates their latitude and longitude, and
creates point geometry representing locations, which it stores as a query
recordset. GeoMedia contains two commands that process addresses; Find
Address for interactive address matching and Geocode Addresses, and one
command, Geocode Coordinates, which converts coordinates to point data.
This chapter will discuss working with addresses first, followed by a discussion
of coordinate geocoding.
Working with Addresses
GeoMedia provides two commands for working with addresses and intersections,
Find Address and Geocode Addresses. Find Address is an interactive navigation
tool that finds the location of a single address or intersection and displays
it as a point in a map window. The Geocode Addresses dialog is used to
create point geometries for a feature class or a query based on input address
or intersection attributes contained in the feature class. Its output is
a query that can be displayed in a map window or data window. It is handled
like any other query. Find Address and Geocode Addresses have some features
in common. They both find address locations by using the information contained
in an Address Coding Guide. Both commands use the same rulesets that define
match criteria. These are referred to as address-match strategies.
The Address Coding Guide
The Find Address and Geocode Addresses commands use a Geographic Data Technology, Inc. (GDT) Address Coding Guide (ACG). An ACG consists of a set of files that provides the city, state, and ZIP Code data needed to locate addresses. It also contains street segment information that includes the street name and starting and ending house number for both odd and even numbered sides of the street. The ACG also contains information designed to compensate for the fact that addresses, address abbreviations, locality names, and so on, have many variations. The GeoMedia-compatible ACGs contain address data for the 50 United States and Puerto Rico in several versions, designed to provide you with different levels of detail and accuracy for your specific address-matching workflows. You can obtain complete AGCs from your Intergraph Sales Representative or Intergraph Business Partner.
GeoMedia is delivered with a sample ACG for Madison County, Alabama.
It is installed by default in the MadisonCountyAL ACG subdirectory
in your Warehouses directory. This set of files allows you to explore the
capabilities of the Find Address and Geocode Addresses commands.
Address-Match Strategies
Both Find Address and Geocode Addresses allow you to tailor the tolerance
of your address search by using one of the three address-match strategies:
Aggressive, Normal (the default), or Conservative. The following table
lists the conditions of each of these address-match strategies:
Address-Match Strategies
|
Address-Match Strategy Conditions |
Aggressive | Normal | Conservative |
| ZIP Code is used to match the address if the address fails to match with the city name and state. | X | X | X |
| Spell correction is more lenient (e.g., the first character can be corrected). | X | ||
| Spell correction is moderate. | X | ||
| Spell correction is strict. | X | ||
| Input address will match only to segments whose first letter matches the input street name within the locality. | X | ||
| Input address will match within 200 addresses to a segment's address range. | X | ||
| Input address will match within 100 addresses to a segment's address range. | X | ||
| Input address will match to the other side of a street segment. For example, if the input house number is odd and there is no odd address range, it will match to the even side of the segment. | X | X | |
| Input address will match even though it has pre- and/or post-directionals and the street segment does not, and vice-versa. | X | X | |
| Input address will match to a street segment with different pre- and/or post-directionals. | X | ||
| Input address will match when the pre- and post-directionals are transposed. For example, the input address of N 18th St West will match to W 18th St North. | X | X | |
| Input address will match to a street segment with a different ZIP Code. | X | X | |
| Input house number must be within a street segment address range. | X | ||
| If the input house number is odd, then the associated street segment must have an odd address range, and vice versa. | X | ||
| Pre- and post-directionals must match. | X | ||
| ZIP Code must match. | X |
Note: Spell correction generally fixes errors such as one letter differences, extra spaces, missing spaces, and transposed characters.
The Conservative strategy follows the strictest set of rules. At the other extreme, the Aggressive method uses the most flexible rule set for finding addresses. In other words, it is able to find more addresses at the expense of accuracy. The differences among the strategies are illustrated in the following examples. In the first example, you want to find the location of the house address 420 James St., but the street database only contains segments for James St. with the ranges of 2-98, 100-198, and 200-298 for the given locality. The Find Address command would not find a match for this address using the Conservative or Normal address-match strategies. With the Aggressive strategy, it would match it to the 200-298 segment because the house address number 420 is within 200 of the house address range on that segment. The software would place the geocoded point on the high end of the segment at the same point where the address 298 James St. would be placed.
In a second example, you want to find the house address of 320 James
St. As in the previous example, the Normal address-matching strategies
would match to the 200-298 segment because that is within 100 of the range
on that segment, as would Aggressive. The location of this address would
not be found using the Conservative strategy.
Finding Addresses and Intersections
The View > Find Address command allows you to locate an address or intersection in a map window. The command is only available when the map window is active. The first time you access Find Address or Geocode Addresses, the Find Address Options dialog will run. After this dialog has been run one time it is only accessible within the Find Address control. Find Address Options allows you to set the path to your ACG file, select an address-match strategy, and to control the display for the geocoded point including style and offset. The dialog is shown in the following illustration.
Caption: The Find Address Options dialog.
The ACG location panel allows you to enter the file system path to the directory containing your ACG files. The first time you run Find Address this field will be empty. On subsequent runs it will contain a list of all the ACGs you have defined in the course of your workflows. Its value will default to the last ACG that you used. All ACG path information is stored in the system registry in \HKEY_CURRENT_USER\Software\Intergraph\GeoMedia\04.00\PreferenceSet. A new key, named ACGPath0, ACGPath1, and so on is created for each ACG path you create. Since this information is stored in the registry, it is used by every GeoMedia session you initiate. The directory in ACGPath0 will be the default location for both the Find Address command and the Geocode Addresses command. The Find Address Options dialog allows you to select a match strategy and the offset and display properties for the location. If the offset value is zero, the command constructs the point, given an address match, at the actual location. When the offset is greater than zero it applies offset distance and units taking into account the parity of the data, if appropriate. Therefor odd addresses will appear offset on one side of the street, even addresses on the other. The offset is not applied when locating an intersection. The three match strategies, discussed above, allow you to specify the rigorousness of the search. Setting the style allows you to choose the point style for optimum display results. These parameters are specific to the Find Address command and, unlike the ACG location, are not used by the Geocode Addresses dialog.
Once you specify an ACG, Find Address provides a dockable control shown below. It allows you to execute view manipulation and other commands while finding addresses. You can zoom in and out of the view, set map window properties, add feature classes to the legend, and so on. To end functions like zoom or pan, you must use the ESC key on your keyboard. Hitting the Select Tool terminates the Find Address control. The control consists of fields for address, city, and zip code, a drop-down list of state name abbreviations, the Find Address button (the binoculars), and the Options button at the far right.
Caption: The Find Address dockable control.
To find a street address or intersection, you enter the address or intersection, city name, state name. Intersection street names are separated by And, &, At, or @. The match strategy you choose will determine the input you must provide. When using the Conservative strategy you must enter information for all attributes. If you coose the Normal strategy you should enter the Street Address, City, and State. ZIP Code is optional. The Aggressive strategy requires the Street Address or intersection attribute with the City and State, or with the ZIP Code. When the Find Address button is clicked, the ACG will be searched for the address. If a match is found, the point geometry will be placed in the map window using the style definition and offset parameters. Map window properties will dictate whether the window will center on the point at the current scale, fit and zoom, or simply view at the current scale. There are several additional points to be made about the map window. First, it must be active for Find Address to work. Say for example, you have a data window open that contains street addresses you wish to copy and paste into the Find Address control. You must reactivate the map window before hitting the Find Address button. You will undoubtedly have at least one feature class displayed in the map window. Remember, the Find Address command is not using your feature data to locate addresses. It is using the information contained in the ACG. This can result in apparent conflicts. Even though you know that your street data, for example, contains a line segment that matches the input address, when the ACG does not, or the match strategy is too stringent, or the search produces an ambiguous match, no point will be placed. Finally, Find Address displays a single point when a match is successful. That point is cleared when a second address match is made. When you have many addresses to locate, or if you need the locations in other parts of your workflow, you should use the Geocode Addresses dialog, discussed below.
Unsuccessful Address Matches
The Find Address command will not find every address you type into the dialog. There are a variety of warnings the dialog will return when a match fails.
Match failed: The street segment for this address is invalid or ambiguous. Any of the fields might be bad including zip code when Conservative matching is selected. This error is also generated when the address can be matched to more than one segment in the ACG.
Match failed: The house number was not given or is an invalid value. The input house number does not fall into any address range in the ACG. You can try the address match using the Aggressive strategy. It will give you the greates flexibility in matching numbers, allowing a match to occur if the numbers are within 400 of each other.
Match failed: The street name was not given, is an invalid value, or is ambiguous. Most likely there is no street name in the address you entered or the address has multiple matches in the ACG.
Match failed: The locality was not given or is an invalid value. The city or state is probably wrong.
Tip: The Geocode Addresses command provides much more extensive information about the reasons for match failures.
Finding Addresses Workflow 28-1
In this exercise you will set up the Find Address Options and interactively display address locations. Create a new GeoWorkspace and open a connection to MadisonCountyAL.mdb. Add the Street feature class to the legend.
Caption: The Find Address toolbar button.
If the Find Address dockable control comes up, rather than the Find
Address Options dialog, Find Address has been run previously. Click on
the Options icon, far right, to run the Options dialog.
At this point, the Find Address Options dialog is dismissed and the Find Address dockable control opens.
Caption: The Find Address icon (left) and the Options icon (right).
Note: A zip code is not required unless the Conservative address-match strategy is selected. In this example, we don’t know what the zip code is for the address we’re looking for.
The location will be added and the contents of the map window will be refreshed according to the option selected in the Map Window Properties dialog; either view at current scale, center at current scale, or fit and zoom out.
Note: You may need to adjust the offset and style for better viewing.
You may also need to adjust the match strategy to obtain the appropriate
results based on the matrix provided earlier in this section. Click the
Options button on the dockable control to reopen the Find Address Options
dialog box.
Caption: Display of an address location with the Find Address control docked under the Standard toolbar.
Caption: Using the View manipulation tools with the results of an address find.
Caption: Locating an intersection with Find Address.
Find Address returns a result for an intersection or address only if
there is a single match found. If more than one intersection or address
matches with equal reliability, an ambiguous match will be displayed.
Geocoding Addresses
You will often find that you have nonspatial databases or files that contain address information such as customer files, mailing lists, or real estate listings. These files and databases, while inherently spatial, are useless for spatial analysis since they contain no geometry. They can, however, be used to create spatial data. GeoMedia features are generated from address data with the Geocode Addresses command. Geocoding addresses is the process of creating point geometry for address or intersection data by matching addresses with information in an Address Coding Guide (ACG). ACGs were discussed in the previous section of this chapter.
Find Address and Geocode Addresses are fundamentally the same operation. Like the Find Address command, Geocode Addresses generates the latitude and longitude of the input data from an ACG using an address-match strategy, and it creates point geometry for the input address. Unlike the interactive Find Address command, Geocode Addresses is essentially a ‘batch’ operation that takes a feature class containing addresses as input and creates a query that contains geometry for locations and additional attribute information. The idea of using a feature class or a query as input has certain implications for your geocoding workflows. The fact that geocoding creates a query recordset means that geocoded output can be managed with any command associated with queries in GeoMedia.
Geocoding Feature Classes and Queries
While your GeoMedia feature classes might contain addresses this information often resides in external data sources such as spreadsheets, text files, or dBase, FoxPro, or Paradox databases. In these cases, you can either connect to the source data with the ODBC Tabular data server or you can use the Attach Table command in the Warehouse > Feature Class Definition dialog to create a read-only feature class. The ODBC Tabular data server is discussed in Chapter 6. The Attach Table command is covered in Chapter 26.
To serve as input to the Geocode Addresses command, a feature class must contain fields that contain a street address, a city name, and a state name. If you plan to use the Conservative match strategy (discussed above) then your attribute data must also include a column containing zip codes.
A query can also serve as input to the Geocode Addresses command. When you need to geocode a subset of a feature class, say you only need address locations for one county but the warehouse contains statewide data, first run an attribute query to select only the records for that county, and then geocode the query recordset rather than the entire warehouse. Conversely, you could geocode the warehouse, then extract the records of interest by querying the Geocode Addresses query.
The Geocode Addresses Dialog
Geocode Address creates point geometry for a feature class or query based on input records that contain attributes that can be identified as street address, city, state, and zip code. A feature class or query and the Geographic Data Technology, Inc. (GDT) Address Coding Guide (ACG) are used to create a query set containing the longitude and latitude that correspond to the input address. The geocoded address points can be displayed in the map window, and the nongraphic attributes of the geocoded points can be displayed in a data window.
The Geocode Addresses command is found on the Analysis menu on the Main Menu bar, or can be accessed from the Geocode Addresses toolbar button, shown below. This command is available regardless of the active window type.
Caption: The Geocode Addresses toolbar button.
The dialog, shown in the illustration below, collects information about the feature class, the ACG, the attributes to be geocoded, the query, and output display.
Caption: The Geocode Addresses dialog.
You select a feature class or query to geocode from the "Geocode Addresses in" drop-down list. The browse button or the drop-down list in the ACG panel allows you to identify the Address Coding Guide to be used. The field will always contain the last ACG used by this command or the Find Address command. The ten most recently used ACG locations are available in the drop-down list.
Next you must select the feature class attributes that represent the address data itself. Each field in the "Address attributes" panel contains a drop-down list of all the text attributes in the selected feature class. All attributes are listed alphabetically. If the Street Address data consists of intersection information, the two streets names must be separated by And, At, &, or @ (case insensitive). The Street Address attribute can contain a mix of both intersection and street address type data.
Note: The match strategy selected will determine which input fields will be used to find address locations. However, the right hand side of the dialog will not become available until you have supplied a column name for each of the four fields. It is not illegal to use the same attribute for more than one of these entries. Let’s look at an example. Say your address data does not include zip codes. Perhaps it came from a telephone directory. You have to use the Normal or Aggressive match strategy since neither uses zip code if city and state names are available. You still have to enter an attribute name in the Zip Code field, but it can be any text field in the database since it won’t be used. In general if the Conservative strategy is selected all attributes must exist in the database. When using Normal you should have the Street Address/Intersection, City, and State data, and optionally ZIP Code attributes. The Aggressive strategy can find locations based on Street Address or Intersection attributes with City and State, or Street Address information with only the ZIP Code attribute.
The Advanced button, which is optional, takes you to the Advanced Properties
dialog box, shown below. This dialog allows you to change your address-match
strategy and the point display offset and units. The "Optional output attributes"
list contains attributes that are part of the ACG. It provides you with
the opportunity to add attribute information to the query recordset from
the data stored in the ACG. The list of available fields is dependent upon
the ACG selected.
Caption: The Geocode Addresses Advanced Properties dialog.
When the Address attributes panel in completed, the balance of the dialog box becomes available. The output query is named Geocoded Addresses of FeatureClassName by default. The Description field is optional. You can elect to display the query results in an existing map and/or data window, and you can modify the point geometry displays using the Style button.
Note: Geocode Addresses returns a result for an intersection or address
only if there is a single match found. If more than one intersection or
address matches with equal reliability, an ambiguous match will be displayed.
Reliability is discussed below.
Geocoding Addresses Workflow 28-2
In this workflow you will geocode the addresses of some health clubs around the Huntsville, Alabama area. The address were derived from an Internet Yellow Pages web site, then converted to a comma-delimited text file using Microsoft Excel. The file was then attached to the MadCo warehouse, and for distribution ease it was output to a feature class. In your workflows the feature class creation step is not necessary. Create a new workspace and open a connection to MadCo.mdb. Add the Street and CountyBoundary features to the legend to serve as a reference map.
A partial listing of the feature class contents is shown in the illustration below. Notice that the table contains only the attributes Club, Address, City, and State.
Caption: Layout of the Health_csv address information.
As you saw above, this table does not contain zip codes, however,
you cannot proceed unless you enter something in this field. The values
contained in whichever column you have selected here will be ignored, as
you are going to use the Normal address-match strategy.
Caption: The completed Geocode Addresses dialog.
Caption: The Geocode Addresses query results.
Geocoding Queries
Now let’s look at the attribute information that was returned by the query you created in the last workflow. There are a number of new attributes that Geocode Addresses adds to the query recordset. There is a set of attributes that describe the geocoding itself, and the set of optional output attributes you requested from the ACG from the Geocode Addresses Advanced Properties dialog.
Geocode Addresses Columns
The Geocode Addresses query will always contain the fields in the input feature class, including those that store the street address, city, state, and ZIP Code information, and the point geometry for the locations if the match was successful. The query result also contains the following Geocode Addresses-specific columns.
Latitude and Longitude
Latitude and Longitude are shown in degrees. These columns are null, or blank, if the match fails or if it is ambiguous. The datum of the ACG can be found in the General.dat file in the ACG directory.
Status
This column contains output status information concerning the match strategy and how this relates to the location output by the process. This field is limited to 255 characters. As you correct the listed problems, this field is updated. This column has the following three states:
· If the match is successful, the column contains a status message and a match rationale. The match rationale states why the match cost was not zero.
· If the match fails and there was not an error, the column contains a match status message and a statement about what is missing or incorrect.
· If there is an error, the column contains an error status and an error message describing the problem. The Status and MatchCost columns for the query in the previous workflow are shown below.
MatchCost
For each item in an address that needs to be changed to resolve the address, the software assigns a cost value to the change or to achieving the match. The value in this column is the sum of each address-match cost. The range is 0-999, with 0 (zero) representing a perfect match and 999 representing a match with many changes made to resolve it. If the address cannot be matched, the value is null. The value in this column is a good indicator of how accurate the addresses are and can be used for comparison between results. Notice in the following illustration the costs associated with the various problems Geocode Addresses found in the workflow data.
Caption: Geocode Addresses Status and Cost columns.
ParsedAddress
This column contains the standardized address that the command returns. This is a concatenation of the whole address and is separated by blanks and commas. If the address could not be resolved, the column contains the all the input values. Compare the input address from the previous workflow with the parsed address created by the Geocode Addresses command. Notice that all characters are upshifted, extraneous information has been stripped, abbreviations have been standardized, and the zip code, which was missing in the workflow data, has been appended.
Caption: Comparison of the input address and the parsed address.
CoordGeocodeStatus
This column contains a null value for successfully geocoded coordinates. It contains an error description for coordinates that are not successfully geocoded.
StreetSide
This column contains the side of the street on which the address is located. This can be one of three values for a valid match.
StreetSide is 0 (zero) when the address has neither a left nor right side. This is the case when the input address given is an intersection of two streets. When the address is on the left side of the street StreetSide will be set to 1. The value of StreetSide is 2 when the address is on the right side of the street.
If the address cannot be matched, the value is null.
Geocoding Coordinates
There are numerous data sets consisting of attribute data that includes
location information in the form of coordinate pairs or triplets. This
data might be stored in a text file, a spreadsheet, or a database. Many
of these data sets are available on the Internet from sources such as NOAA,
blah, blah, blah. While these data are inherently spatial in nature they
lack the geometry necessary for spatial analysis. The Geocode Coordinates
command creates point geometries for a feature class or query based on
coordinate values stored in that feature class. It outputs results in the
form of a query, which includes a status indicator field for troubleshooting
bad coordinate data.
Like the Find Address and Geocode Addresses commands discussed in previous
sections, Geocode Coordinates takes a feature class as its input. If you
are using external data it must be connected to your workspace with the
ODBC Tabular data server (see Chapter 6), or Attached as a feature class
in the Warehouse > Feature Class Definition dialog (see Chapter 26). Geocode
Coordinates also takes a coordinate system definition as input. The coordinate
system of the source data is defined with a coordinate system file (.csf),
with a design file (.dgn), or interactively in the Geocode Coordinates
dialog.
The Geocode Coordinates command mathematically converts two or three-dimensional
coordinate data in any supported projection into point geometry. It will
process coordinate values stored as text, integer, long, single, and double
data types, in any coordinate units (for example degrees or radians) in
any format (for example decimal degrees or d:m:s) supported by GeoMedia.
The Geocode Coordinates Dialog
The Geocode Coordinates dialog is found on the Analysis menu on the Main Menu bar. This dialog, shown below, allows you to select a feature class for geocoding, define the source coordinate system, name the output query, and specify output display parameters.
Caption: The Geocode Coordinates dialog.
In the "Geocode attributes in" drop-down list you select the feature class or query containing attributes to be geocoded. In the "Coordinate system of attributes" panel, you have two ways to specify a coordinate system. The Define button takes you to the standard Define Coordinate System dialog, which allows you to review and/or modify the attributes of the default coordinate system. That dialog is discussed in detail in Chapters 6 and 39. You have the option of saving the current coordinate system definition with the Save As button. This command will create a .csf file, which can be reused by Geocode Coordinates or any other object that requires a .csf file. Your second option is to select an existing .csf file with the Browse button. Once you have either defined the coordinate system or located a .csf file, you tell Geocode Coordinates about the storage format of the coordinate data using the "Units and Format" button. The Units and Format dialog, shown below, allows you to specify which format your coordinate values are stored in.
Caption: The Geocode Coordinates Units and Format dialog.
Individual fields are available as a function of the base storage type, Geographic or Projected, of the coordinate system. With the coordinate system parameters established, you move to the "Coordinate attributes" portion of the dialog. The names of the first and second coordinate fields vary dynamically with the selected coordinate system, units, and format. If the base storage type of the coordinate system is Geographic the fields will be labeled Latitude and Longitude, and will be tagged with the units you selected. If the base storage type is Projected, the field will be labeled with the Projection Quadrant and units selected. In either case, you select the name of the database column where the coordinates are stored from the drop-down list. The "Height" field is optional, and defaults to <None>.
The "Query name" defaults to Geocoded Points of FeatureClassName. You have the option of changing it and of adding a Description to the query. Finally, you select the output parameters. You can elect to have the query displayed in new or existing map and/or data windows, and you can predefine map symbology with the Style Definition key.
The geocoded points are generated and displayed in the specified map
window and/or data window. The query contains all the columns in the input
feature class plus the new geometry. A column named CoordGeocodeStatus
is added to the query. If geocoding fails for any record, this field will
contain some diagnostic information about the kind of problem it had with
the coordinate data. The output query is handled as any other query. It
is saved in the GeoWorkspace. It can serve as input to other queries and
it can be output to a feature class for permanent storage in a read/write
warehouse.
Geocoding Coordinates Workflow 28-3
In this exercise you will geocode earthquake locations. The file of earthquake locations was created from a query against the National Atmospheric and Oceanic Administration’s ‘Significant Earthquakes Database’ at http://www.ngdc.noaa.gov/seg/hazard/earthqk.shtml. It includes attribute information for ‘significant’ earthquakes occurring in the years 1950-2000 for U.S. The file was reformatted in Excel, output as a comma-delimited text file, Attached to a read/write warehouse, and imported as a feature class. It was imported as a feature class solely for the purpose of distributing the exercise data. It could just as easily be geocoded if it were an Attached table or if it were served up by ODBC Tabular.
Create a new GeoWorkspace and connect to SigEarthquake.mdb warehouse. Add the PROVINCES and STATES feature classes to the legend.
Caption: The Geocode Coordinates toolbar button.
Notice that the parameters that refer to projected data are unavailable.
Note that the units specified in the previous dialog ( deg and km) are displayed with the field names in this portion of the dialog.
Caption: The completed Geocode Coordinates dialog.
Caption: The Geocode Coordinates query in the map window.
You can now use the query result to create a thematic display, for example
by earthquake magnitude, you can execute new queries against the geocode
query, or you can use the Warehouse > Output to Feature Class command to
create a graphic feature class with the geocode query.