Analysis and Data Improvement Tools

Analysis and Data Improvement Tools

NAACCR Committees and members have worked collaboratively to develop tools and resources for use by central cancer registry analysts and researchers. Select one of the options below to learn more.

NAACCR Central Registry Analyst Handbook

Overview

While each registry is different, most analysts perform the same essential functions, and in turn, come across similar issues. The materials here provide information and guidance to new and established analysts at central cancer registries, who both utilize population-based cancer surveillance data and release it to other researchers for analysis. The goal is to provide a comprehensive set of resources  to support all aspects of data use and research in central cancer registries.

This page will present a range of general guidance from using the NAACCR data dictionary to navigating some of the more complex variables that have changed over time. Additional resources under development include information on obtaining population data, creating cancer surveillance statistics, geospatial analysis, and how to handle different data requests, from aggregate statistics to community cancer concerns. This page will also include tips and resources for running data queries and using some of the common registry software. While these resources will cover many of the functions an analyst performs, it is not considered an exhaustive list. Analysts should always follow the established protocols and best practices for their registry, where applicable.

Guidance, best practices, and other resources here are maintained by the RDU Research Analyst Handbook Taskforce (RAHT) in coordination with other RDU subcommittees, Central Registry Operations Standards (CROS), and the NAACCR Executive Office. Resources will be published as they are made available. We encourage analysts to reach out to RDU RAHT with suggestions for additional topics, as well as to share resources they may have developed at their registry.

Subsections:

Role of the Central Registry Analyst

NAACCR Data Dictionary

Data Quality from the Analyst Perspective

Generating Cancer Surveillance Statistics

How to Obtain National Cancer Statistics

Complex Variables

Obtaining Population Data

Research Data Requests

Running Data Queries

Data Analysis for Cancer Control Programs

Epidemiology

Geospatial Analysis

Confidentiality, Data Security, and Data Transmission Best Practices

Performing Data Linkages

Call for Data

 

Additional Resources

The VENUSCANCER Study is embedded in the CONCORD programme, a world-wide initiative designed to explain the global inequalities in patterns of care, short-term survival and trends in avoidable premature deaths from breast, cervical and ovarian cancers, the three most common cancers in women. This project aims to provide levers for health policy to reduce or eliminate avoidable differences in survival from these cancers. The SAS code posted here maps or translates NAACCR V22 to the VENUSCANCER data dictionary.

VENUSCANCER SAS Code for V22 – Updated June 2023

March 2023 Webinar

Population data by race/ethnicity for congressional districts can be ascertained by aggregating census blocks, but this data is only available as part of the decennial census. Intercensal tract level population estimates by race/ethnicity are available, but census tracts do not translate directly to congressional districts as some census tracts are split across one or more congressional districts.

An analysis was conducted to compare populations between actual congressional districts to tract-estimated congressional districts to determine the potential effect of calculating cancer rates by congressional districts using tract estimated congressional district populations. For the analysis, each tract was assigned to only one congressional district.

  • When a tract is split across multiple congressional districts, the tract was assigned to the congressional district based on the piece of tract that has the most people.
  • When the population of split tracts was equal, the tract was assigned to the lowest numbered congressional district.

Block-level populations were aggregated to produce counts for both the actual congressional districts and the tract-estimated congressional districts by:

  • Sex: male, female
  • Race/ethnicity groups: non-Hispanic white, non-Hispanic black, non-Hispanic American Indian or Alaska Native, non-Hispanic Asian Pacific Islander, non-Hispanic Other, Hispanic
  • Age groups: 0-49, 50-64, 65+

To compare the actual congressional district populations with the tract-estimated congressional district populations, we converted the counts of each population subgroup to percentages and calculated the absolute value of the differences between the two percentages. We also summarized the percent difference in population for all the congressional districts in each state and for the US as a whole.

After comparing differences in counts in the overall populations and populations by sex, race/ethnicity, and age groups, the tract-estimated congressional districts were found to have small differences compared to the actual congressional district population counts. With small differences in populations and subgroups of interest, these tract-estimated congressional district populations could be used to calculate congressional district cancer rates that would likely be close to the rates if the actual congressional district populations were available. Using these tract-estimated congressional districts to calculate cancer rates has the added advantage of aligning well with cancer case geocoding that is usually developed and verified at the census tract level.

We are currently supporting the 117th Congressional Districts using 2010 tract boundaries and the 118th Congressional District using both 2010 and 2020 tract boundaries.

Epidemiologists cannot ignore the impact of social conditions on population health. Cancer registries currently collect the Krieger Poverty codes which is an area based social measure (ABSM) based on the census data on the percent of people living below poverty. These codes are available in cancer registry data in the US at the county and census tract-level and can be used to assess the impact of poverty on an individual-level, using the poverty ABSM as a proxy, and community-level, addressing both the compositional and contextual effects of social environment on cancer.

The Krieger codes have been the standard, but the codes were developed using New England census data. For other regions on the county with higher poverty rates and for population groups with higher poverty rates, the Krieger cutpoints result in residual confounding and using these cutpoints can mask real disparities, particularly when analyzing minority populations. Additionally, other social data may also be important to include in etiologic and public health planning research, such as language isolation or housing security. Instead of relying on just the single poverty measure of SES, researchers have developed a multifactal socioeconomic index to evaluate the potential impact of socioeconomic gradients on cancer burden. This index, called the Yost Index because it was developed by Kathleen Yost, requires a number of area based social measures that are available from the census.

With the above in mind, NAACCR is requesting supplemental, tract-level ABSMs to be submitted during call for data for evaluation. The variables are time dependent, based on diagnosis year, and the data are appended to the case using NAACCR*Prep. The additional variables requested are now minimal in number because we are linking to a predefined SES index, the Yost variable, instead of all the component parts of the index. We are also collecting SES quintiles by race. The Yost variable has been assessed and recoded to limit uniqueness of the data and ensure limited risk of disclosure of census tract. The SES index is identical to the SEER composite SES score. More information is available here: https://seer.cancer.gov/seerstat/databases/census-tract/index.html.

These data are still being by NAACCR. None of these variables are released for any research without the consent of the submitting registry. The ACS Quintiles by race/ethnicity will be used to develop useful, race/ethnic based cutpoints to enable research on poverty by minority groups. During this evaluation period, NO supplemental data will be released to outside researchers.

Central registries have two ways to calculate these variables. After the variables are appended, using the combination of geocoded state, county, and tract, the tract can then be stripped from the data. This is an option in NAACCR*Prep. But a registry may also choose to maintain the tract for 2010 and 2020 and submit to NAACCR for evaluation for CiNA Geographic. The tract data will be separated and stored separately from the other CiNA data. Access tract data currently limited to the NAACCR Program Manager of Data Use and Research, Dr. Recinda Sherman and required IMS database administrators. Use is currently limited to evaluation and fitness for use assessment only, not research.

If you have any questions, please contact Recinda Sherman.

NCI, in collaboration with NAACCR, is working with individual registries to develop a set of cancer reporting zones across the U.S. that are more suitable for cancer data reporting than counties. In each respective state, the zones will be custom crafted to represent areas that:

  • are meaningful to stakeholders in terms of cancer reporting and cancer interventions;
  • comprise adjacent census tracts and smaller counties (or portions of counties) that sum to population sizes that are sufficiently large to support stable rates;
  • collectively cover the entire population of the state;
  • are homogeneous with respect to important sociodemographic characteristics and are compact in size;
  • have large enough case counts for data reporting without compromising confidentiality; and
  • result in a relatively small proportion of areas with suppressed values, although for rarer cancer sites suppression will be inevitable, especially when producing rates stratified by sex and/or race.

These zones subdivide large population urban counties and are collections of smaller rural counties (or portions of counties). They have a minimum population size of 50,000. Participation in the project is voluntary. If your registry is interested in participating, please contact Recinda Sherman. To learn more about the project and methodology, see the article Developing Geographic Areas for Cancer Reporting Using Automated Zone Design[DN1] .

As of July 2024, NCI has developed and finalized zones associated with 22 registries and their catchment areas. A crosswalk of census tract 2010 to cancer reporting zone is available for these areas and can be used to calculate cancer rates.  The crosswalk includes a 11 character tractID based on 2010 census tract geographies, a 8 character ZoneIDOrig variable that identifies the unique cancer reporting zone within a registry’s catchment area, and a 10 character ZoneIDFull variable that identifies the unique cancer reporting zone across registries (consists of the 2-digit state FIPS code followed by the 8 character ZoneIDOrig). In addition, a variable Zone_Tract_Certainty is included in the crosswalk. This variable is a flag to indicate whether census tract is needed in order to assign the zone. There are two codes for this field: 0 = can assign zone based on county; 1 = need tract to determine zone. This flag can be used in conjunction with census tract certainty where only high certainty census tracts should be used for assigning the cancer reporting zone when Zone_Tract_Certainty = 1 and subsequently calculating statistics at the zone level.

This algorithm combines NHIA and NAPIIA into a single SAS program.

This program is used with a NAACCR standard data exchange file format with confidential information, including a census tract identifier. The program will link the census tract identifier with the percent of the residents in the census tract that live below the poverty level. This information is based data from the 2000 U.S. Census and the American Community Survey. The data used is the census data most closely aligned with diagnosis year. The program will output two variables that will be attached to every registry record inputted: the xx.x% poverty for the census tract, and a second variable that groups the exact percents into four categories: less than 5% poverty, 5%-9.9% poverty, 10%-19.9% poverty, and 20% or higher poverty.

Rural-Urban Data Items

Studies have shown that residents of rural areas have lower screening rates, lower rates of follow-up of abnormal screening tests, higher late-stage diagnosis rates, and differences in cancer treatment patterns.  Including tract-level indicators of rural-urban residence in the NAACCR data files will facilitate research in rural-urban disparities and allow researchers to control for rural-urban differences in model-based analysis of cancer risks and outcomes.

This SAS code creates 2 different measures of the rural-urban environment.  The URIC is a measure of the rural nature of the place of residence and can be an indicator of access to recreation, access to food stores, exposures to pollutants, crime levels, social cohesion, etc.  The USDA RUCA-based indicator is a measure of the proximity to large urban centers and can be an indicator of access to oncology specialists and cancer treatment facilities.  Both indicators have been tested for uniqueness and they do not allow the identification of individual census tracts as long as the county is not known.

Description of items:

Two indicators of the rural-urban environment based on the census tract of the diagnosis address:

  • Urban Rural Indicator Codes (URIC) is based on the Census Bureau’s identification of urban and rural areas
  • Rural Urban Commuting Areas Codes (RUCA) is based on the USDA’s Rural Urban Commuting Area (RUCA) codes

Cases diagnosed between 1995 and 2004 are assigned a code based on the 2000 U.S. Census. Cases diagnosed since 2005 are assigned a code based on the 2010 U.S. Census. 

Allowable values:

  • URIC :
    • 1: all urban – the percent of the population in an urban area = 100%
    • 2: mostly urban – the percent of the population in an urban area < 100% and ≥ 50%
    • 3: mostly rural – the percent of the population in a rural area < 100% and > 50%
    • 4: all rural – the percent of the population in an rural area = 100%
    • 9: unknown or not applicable – census tract not available or tract population was zero at the last decadal census
    • C: the state + county + tract combination was not found in the lookup table
    • D: either the state, county, or tract were blank or an unknown value (e.g., state was “ZZ”, county was “999”, etc.)

 

  • RUCA
    • 1: urban commuting area – RUCA codes 1.0, 1.1, 2.0, 2.1, 3.0, 4.1, 5.1, 7.1, 8.1, and 10.1
    • 2: not an urban commuting area – all other RUCA codes except 99
    • 9: unknown or not applicable – census tract not available or RUCA code = 99
    • C: the state + county + tract combination was not found in the lookup table
    • D: either the state, county, or tract were blank or an unknown value (e.g., state was “ZZ”, county was “999”, etc.)

Supplemental Area Based Social Measures (ABSM)

Epidemiologists cannot ignore the impact of social conditions on population health. Cancer registries currently collect the Krieger Poverty codes which is an area based social measure (ABSM) based on the census data on the percent of people living below poverty. These codes are available in cancer registry data in the US at the county and census tract-level and can be used to assess the impact of poverty on an individual-level, using the poverty ABSM as a proxy, and community-level, addressing both the compositional and contextual effects of social environment on cancer. The Krieger codes have been the standard, but the codes were developed using New England census data. For other regions on the county with higher poverty rates and for population groups with higher poverty rates, the Krieger cutpoints result in residual confounding and using these cutpoints can mask real disparities, particularly when analyzing minority populations. Additionally, other social data may also be important to include in etiologic and public health planning research, such as language isolation or housing security. Instead of relying on just the single poverty measure of SES, using these additional fields allows researchers the flexibility to use different cut-points for poverty for research on minorities and to incorporate additional SES contextual variables as needed into analysis. Researchers can also create multifactal socioeconomic indices to evaluate the potential impact of socioeconomic gradients on cancer burden, such as the Yost Index.

The list of variables in a separate Excel spreadsheet is available here under ‘Supplemental Information’. This SAS Code currently pulls the variables from 2 different time period and appends the data based on census-tract. The tract is then stripped from the data.

If you have any questions, please contact Recinda Sherman at [email protected].

Registries can use this worksheet for a reasonable assessment of current case ascertainment. Please keep in mind that adjustments were made to the method for diagnosis year 2020 and forward. But using the most recent diagnosis year spreadsheet will apply the most current population estimates. Therefore, your completeness monitoring will be closer to the official NAACCR estimate.

The Record Uniqueness Program was developed by Howe, Lake, and Shen to assess electronic data files for risk of confidentiality breach based on unique combinations of key variables.

This is a software utility developed in MS Access to identify miscoded sex codes based on first name. Taking as input a data file in NAACCR v16 format, a query runs against a list of known sex/name pairs, and it produces a list of cases for manual review that have potential errors in sex. The utility is based on an algorithm initially created by the New York Cancer Registry in August 2011.

In real world registry settings, the number of potential errors flagged by the tool  is extremely low – in the neighborhood of 0.25%. After careful review, users have reported that about 20-50% of the cases identified by the tool in need of review are indeed in error. Higher percentages have been found when the tool is run on incoming registry data. For cases where the edit flagged a sex that was correct, a misspelling of the name was often identified. For male breast cancer, nearly all flagged cases were errors, a consequence of the highly skewed sex distribution of this cancer site. A published study on this tool is available here.

A list of tools which can import and export data in NAACCR Volume II format.

V24 SAS TRANSLATION TOOL (AUGUST 2024)

Here is a SAS program and Word instructions for reading and writing NAACCR V23 XML files using SAS. We are grateful to Fabian Depry, IMS for updating these resources annually and to Chris Johnson for spearheading this resource.
The SAS program is to be used in conjunction with the Word document. Note: Use of this code does require proficiency is SAS.

Before starting, read the instructions first.
Really.
It will make your life better.

As you use the tool, we appreciate any feedback or comments you have (contact Chris Johnson of the Idaho registry)

V23 SAS TRANSLATION TOOL (AUGUST 2023)

Here is a SAS program and Word instructions for reading and writing NAACCR V23 XML files using SAS. We are grateful to Fabian Depry, IMS for updating these resources annually.

The SAS program is to be used in conjunction with the Word document, Read the instructions first. Really. It will make your life better. Note: Use of this code does require proficiency is SAS.

As you use the tool, we appreciate any feedback or comments you have (contact Recinda Sherman)

V22 SAS TRANSLATION TOOL (September 2022)

The SAS program and accompanying Word document are to assist the NAACCR community in reading and writing NAACCR XML V22 files using SAS.

The SAS program is to be used in conjunction with the Word document,

“Instructions for ReadWrite_NAACCR_22_XML_tidy.sas_20220926.docx.” Read the instructions first. Really.

The SAS program harnesses SAS code and macros written by Fabian Depry, IMS, adds SAS labels, and removes fields from the SAS datasets that are 100% missing.

The code template below can be used by proficient SAS programmers to efficiently and accurately access data in the NAACCR XML V22 format. Code to both read and write NAACCR XML V22 format is provided. Various sections and options are included – users simply comment out sections which are not applicable for their specific needs. The SAS code supports the three most often used record types (Incidence, Confidential and Abstract (which includes text fields).

As you use the tool, we appreciate any feedback or comments you have (contact Recinda Sherman)

 

V21 SAS TRANSLATION TOOL (October 2021)

This version of the V21 SAS translation tool is designed to work with naaccr-xml-utility-8.6, which was posted to https://github.com/imsweb/naaccr-xml/releases on October 12, 2021. This update supersedes the September 2021 update which made it easier to harness the SAS translation tools.

Changes since 8.4 include:

  • Upgraded all base dictionaries to specifications v1.5; added new dateLastModified attribute.
  • Added a new optional ‘cleanupcsv’ parameter (defaults to true) to allow the temporary CSV to not be automatically deleted.
  • Improved feedback messages the macros write to the logs.
  • Improved help written in the macros.

The three latter changes were added mainly to improve QC and/or debugging processes.

 

The SAS program and accompanying Word document are to assist the NAACCR community in reading and writing NAACCR XML V21 files using SAS.

The SAS program is to be used in conjunction with the Word document,

“Instructions for ReadWrite_NAACCR_21_XML_tidy.sas_20210928.docx.” Read the instructions first. Really.

 

The SAS program harnesses SAS code and macros written by Fabian Depry, IMS, adds SAS labels, and removes fields from the SAS datasets that are 100% missing.

 

The code template below can be used by proficient SAS programmers to efficiently and accurately access data in the NAACCR XML V21 format. Code to both read and write NAACCR XML V21 format is provided. Various sections and options are included – users simply comment out sections which are not applicable for their specific needs. The SAS code supports the three most often used record types (Incidence, Confidential and Abstract (which includes text fields).

As you use the tool, we appreciate any feedback or comments you have. Contact [email protected] with your thoughts.

 

v18 SAS Translation Tool

The code template below can be used by proficient SAS programmers to efficiently and accurately access data in the V18 format. Code to both read and write ASCII V18 format is provided. Various sections and options are included – users simply comment out sections which are not applicable for their specific needs. The code supports the three most often used record types (Incidence, Confidential and Text).

New for V18, there are two versions of the SAS code: one that uses NAACCR item numbers for the SAS variables names (as has been done in the past), and one that uses NAACCR XML names (NAACCR ID) for the SAS variable names. As NAACCR makes the transition from the “flat” ASCII file to XML, we encourage you to utilize the SAS code with NAACCR XML names. It will help prepare you for the future! As you use the tool, we appreciate any feedback or comments you have.  Contact [email protected] with your thoughts.

To find more XML tools developed by NAACCR members, visit https://www.naaccr.org/xml-data-exchange-standard/.

Note: Translations tools for V14, V15, and V16 includes code that handles data elements which are part of the CDC’s Comparative Effectiveness Research (CER) and Patient-Centered Outcomes Research (PCOR) projects.

 

V16 & V15 MS ACCESS TRANSLATIONAL TOOL

This MS Access database contains an import/export file specification for NAACCR v15 and v16 record layouts. It allows the user to import these types of files, perform operations on them, and then export them back out as a text file in the same format. Contact [email protected] if you have any feedback on this tool.

Along with incidence and mortality data, information on population-based cancer survival is necessary to understand the full burden of cancer in our society. This SAS code is used to create the variables needed to conduct relative survival for the CiNA Volume 4: CiNA Survival. It is made available here for use by researchers on their own data and currently updated for a study cutoff date of 2015.

Copyright © 2018 NAACCR, Inc. All Rights Reserved | naaccr-swoosh-only See NAACCR Partners and Sponsors