But if you work with census data, this may save your day:
https://github.com/greeninfo/CensusShapefileMaker
It does a day's worth of census downloading, joining, trimming, and calculation... in 10 to 15 minutes (most of that being the FTP download).
History
I created this for a client. Her specific use case, is to have census blocks for a whole state (though sometimes one county), joined to a few attribute tables so we have statistics such as number of people, number of people who are black/white/etc, number of people who are under 18, and so on. We'd then derive some statistics such as "number of people under 18" which itself means adding up 8 attributes then dropping those 8 since we really only wanted their sum. And of course, we only want a dozen or so columns but those attribute tables are huge, so she'd drop hundreds of columns with unintuitive names such as p61ia She would do this process in ArcMap and it would take an entire day or sometimes two, and was highly error-prone.
So, I sat down with MCDC's Dexter and the USCB FTP and figured things out.
A few tricks and paths
- Dexter allows one to specify only specific attribute fields to be downloaded, and allows downloads of both decennial census and annual ACS. This is great, since it makes the CSV files so much simpler to parse: they have the fields you asked for.
- I did try having ogr2ogr perform the join between the CSV and the shapefile -- after a couple of hours it was obviously not getting anywhere. So the technique is to open the shapefile via Python ogr and loop over records, assigning their attributes. For performance, it loads the whole attribute table into a dict which sounds bad, but in reality works fine since modern laptops have 8 GB of RAM.
- Good ol' ogr2ogr is still used for stripping out attributes, e.g. the stock shapefiles have extra attributes we didn't care for. This could have been done in Python ogr as well, but was already done in ogr2ogr.
- We wanted a YOUTH field indicating number of people under age 18, but the Age By Sex table has that broken over 8 fields. So at the merge phase, that youth sum is what it loads from the CSV in lieu of the 8 individual fields.
- Median Household Income (MHHINC) comes from ACS and not from decennial, and comes at the tract level instead of the block level. Fortunately, the hierarchical naming convention for the GEOID makes it dead simple to figure out which tract corresponds to a given block. During the merge phase, the first 11 digits of the tract are used to look up the corresponding tract -- easy.
- As an afterthought, the thing can also strip down to one specific county (identified by its FIPS code, in the hierarchical GEOID) if you didn't want the whole state. This is done at the very end, since frankly the processing is so fast that the seconds you'd save by pruning it first aren't worth the programming trouble.