TANGO-DocLab web tables from international statistical sites (Troy_200)
web tables, table segmentation, table headers
Statistics Canada http://www.statcan.gc.ca/start-debut-eng.html
The World Bank, http://www.worldbank.org/
Statistics Norway, https://www.google.com/?gws_rd=ssl#q=Statistics+Norway
Statistics Finland, http://www.stat.fi/index_en.html
US Department of Justice, https://www.justice.gov/
US Energy Information Administration, https://www.eia.gov/
US Census Bureau. http://www.census.gov/
DATA COLLECTION: About 1000 tables were collected from international statistical websited by DocLab graduate students in2009-2010. A perceptually random subset of 200 of these tables was converted (mostly from HTML) via EXCEL into CSV files and stored at DocLab. The original HTML files were not retained and their URLs were kept somewhat haphazardly. Many of these tables can still be found by a web search, but others have been modified, corrected, or updated on the original sites. The dataset now consists only of the 200 CSV file representation of 200 web tables.
GROUND TRUTH: The four critical cells (top-left and bottom right of the stub-header and top-left and bottom-right of the data region) were entered with an interactive tool (VeriClick) in 2011. Several GT errors were subsequently found during segmentation experiments and corrected. Researchers may still disagree on 2-3 tables on whether a particular row or column (unusual contents or units) belong to the data region. The GT for all 200 tables is a single CSV file that allows adding to it and verifying the results of a header segmentation experiment.
Each table is a separate CSV file. Some of the CSV files have some invisible blank columns to the right and invisible blank rows below them. There are no blank rows to the left or above.
The GT_Troy... CSV file has 200 rows and 5 columns. The first column is the file name. The next four columns are the critical cells in Excel A1 notation. The second and third cells span the rows of the minimal indexing column headers and the columns of the minimal insexing row headers. The last two cells span the data region. There may be Notes rows between the minimal indexing column header and the data region, and Notes or Footnoes below the data regsion.
The metadata also includes GT_SAUS_200, in the same format, for 200 pseudo-randomly selected tables from the 1321 spreadsheets of the Statistical Absbracts of the United States (SAUS) that were collected an posted by Professor Michael Cafarella at:
This GT can be used if the corresponding files are downloaded from the above website. These files are in formatted Excel formats, not CSV, but there is usually a 1:1 correspondence in cell addresses.
Some results of our experiments on both the Troy and SAUS tables, and references to our previous publications on tables, can be found in "Converting Heterogeneiys Statistical Tables on the Web to Searchable Databases", David W. Embley, Mukkai Krishnamoorthy, George Nagy, Sharad Seth, Int'l J. Document Analysis and Recogntion (IJDAR), Springer, posted on-line February 11, 2016.