One Day Build – Life Expectancy Comparison a la SQL and Python

I was inspired today to continue learning. Thanks to some folks over at Penny University (pennyuniversity.org), I found a quick little learning opportunity. I am focused this month on learning some more skills in SQL and in JS. A list and map of life expectancy data was posted to our slack channel (wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy) and showed some very interesting level of detail. County and census track level analysis had been done for the first time (https://www.pnas.org/content/117/30/17688) and this paper indicated that life expectancy varies at an even smaller, very local scale in the US. So, that was a neat read. But I noticed two of the tables in the wikipedia article, one that shows the break down of these numbers by state and another by the 50 largest US cities. I had a guess that the largest population centers in the US had a large effect on the data of the states in which they are located. This looked like an opportunity to try to do two things:

  1. Get a statistic for how similar the city data and the state data are for states where the 50 largest cities are located.
  2. Plot the same in python and hopefully get the same answer.

Half of this build was spent just cleaning up 5 data sets I grabbed related to this idea, 3 from the wikipedia and 2 from the CDC. I’ll probably keep playing with this so it was neat to try taking 5 .csv files and dumping them into a MySQL database. I got to remind myself how to regex find and replace with back references in gedit and that was kinda terrible, but with the datasets cleaned up sufficiently I could make a script for building the tables in a new database (No really, this took a while). I learned today about mysqlimport and the use of –local and I learned another way via ‘LOAD DATA INFILE’. I learned the use of ‘FIELDS TERMINATED BY’ because some of my files were tab separated and some were comma separated.

My first gotcha came when I learned about DECIMAL declarations for the numerical fields I wanted to import. DECIMAL is really an integer by default and that’s weird to me. The default is to give you a rounded value until you tell it specifically how many digits there are in total and how many come after the decimal point. 

Then there was some more back and forth deleting and loading the data while checking on the warning messages. This was pretty straightforward. 

Then I had to come up with the interestingly complicated SQL call below:

select avg(d) as avgd from (select avg(s.LE2018)-avg(c.LE2018) as d
 from LEByState1018 s join LEByCity c on s.State=c.State group by s.State) t;
 +----------------+
| avgd |
+----------------+
| 0.037269585161 |
+----------------+
1 row in set (0.01 sec) 

I want to have the states that are in common between the two tables. I do that by from table1 alias1 join table2 alias2 on condition , then I can take the average on the grouping of states. The trickiest part for me was realizing I needed to have avg(s.LE2018) because in this table each of these only had one entry. But, there was a check against these entries being not aggregated in the group. Then I have to make sure to name everything and it works. I have my answer: 0.037 years or 13.5 days is the amount by which large cities differ from their states’ average on average.

But what is the breakdown? I want to see this stuff plotted. So I will see what I can do in python. There was a lovely post from https://plotly.com/python/v3/graph-data-from-mysql-database-in-python/. With the MySQLdb library and pandas I was able to drop my SQL tables into some dataframes with 4 lines of code easily implemented with no surprises.Comparison of States and Cities for LE

Comparison of 2018 Data for States and 50 largest US Cities for Life Expectancy. wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy copied 10-8-2020

 

I made some plots and could see that indeed in many cases there is a close correlation. The abscissa doesn’t mean a whole lot but you can see the main point. Everything trends pretty close to the mean +- about a year and a half. The obvious point of note is Virginia Beach having a six year lower life expectancy than Virginia as a whole. My guess is there is a skew towards DC. But who knows. If I decide to go deeper on this I might try to recreate the data from the PNAS paper for a next little challenge. 

One day build – Scheduler

If have to do anything more than twice, I’m likely to make a script to help. 

Like so many others we are keeping to ourselves these days and it was helpful to make a scheduler pop-up for telling me when to move my child onto the next task during the day. Enter my super easy one day build. 

The relevant library was notify2 in python.  It tells my OS to pop-up a notification and send me a message.

import notify2
from playsound import playsound as ps
import schedule
import time

Then I set the important variables and initialize the relevant objects. These are a Notification object and a sound file name string. 

notify2.init("MyName")
n = notify2.Notification(None)
n.set_urgency(notify2.URGENCY_CRITICAL)
n.set_timeout(5000)
song_file = "gw151226.mp3" 

It plays a gravitational wave chirp. Quite nice. Then, I define one of several ‘jobs’.

def job1():
    n.update("message!")
    n.show()
    ps(song_file)

where ps is playsound.playsound() from the import line. It is then invoked with some variant of the ‘schedule’ program. 

schedule.every().day.at("08:30").do(job1)

And that’s all there is. 

Be well, stay safe, wear a mask. 

N

 

On the Date of State Of The Union

As I sat at lunch today, I noticed that the state of the Union is tonight. I thought, ‘Wow! That’s late isn’t it?’ But I wasn’t sure.

So, I did the math. Here is the result:

If we look back to the dates from 1910, the current date is 1.3σ away from that mean. Furthermore, presently (since 1980), the later dates with smaller variance are still only 1.5σ away from the mean. If these fluctuations were random, 3σ would be within expectations. The σ of 6 for this time period is exactly what you would expect for random fluctuation about some average.

The only interesting thing I noticed was that the ~1940 lower average seemed to transition between 1960-1980 to a new later average.

Does anyone know of a reason why?

One Day Build (House Hunting)

We’re in the middle of house hunting in Nashville (which is booming).

So, I wanted a tool that would help me vet addresses as they pop-up in my feed from our realtor.
We wanted to be able to walk to some things in our neighborhood and fortunately that criterion a pretty easy thing to map out by hand in inkscape with screenshots from Google maps. I was able to produce a png map that had those regions clearly identified. For those that remember the 2010 floods, water ways are something to be wary of. So I pulled a map from FEMA (maps.nashville.gov) and overlayed that with a transparency by hand, easy enough. I could also have tried to overlay crime maps and offender registries, but this was sufficient for triage.

Now I just needed a converter(mapper) between the x,y in inkscape and gps coordinates.
I used PIL (python image library, aka pillow) to draw on the picture I had created.
The converter looks like this:

def mapper(lat,long):
#scale = [4528.104575164487, -162821.9551111503, 3633.747495444875, 315924.9870280141]
#return(long*scale[2]+scale[3],flip-( lat*scale[0]+scale[1]))
scale = [-5228.758169935899, 189234.5555556009, 4041.34431458903, 351362.65809204464]
return(long*scale[2]+scale[3],lat*scale[0]+scale[1])

There are two attempts here because a conversion from GPS coordinates to inkscape coordinates is unfortunately not the same as GPS to pillow.

I derived this via two “calibration” points on my map and the respective coordinates in pillow.

given = [[36.039065, -86.782672],[36.042890, -86.606493]]
res = [[644, 795],[1356,775]]
a = (given[1][0] - given[0][0]) / (res[1][1] - res[0][1])
b = -a * res[0][1] + given[0][0]
c = (given[1][1] - given[0][1]) / (res[1][0] - res[0][0])
d = -c * res[0][0] + given[0][1]
scale = [1/a, -b/a, 1/c, -d/c]

This finds slope and offset for two categories , latitude and longitude, based off of four points and is exact (those potentially off by a little depending on the accuracy of my calibration points). I should’ve picked points better, because 36.039 is not very different from 36.042. Oh well. In the end it worked.
Then I just hardcoded the values of the variable scale into the function mapper.

I have my x,y coordinates from latitude and longitude. Now I want to draw on my map.

def drawer(coords):
im = Image.open("pillow.png")
draw = ImageDraw.ImageDraw(im)
flip = im.size[-1]
for pair in coords:
vec = [mapper(pair[0]+0.001,pair[1]-0.001,flip),
mapper(pair[0]-0.001,pair[1]+0.001,flip)]
print(vec)
draw.ellipse(vec,fill=100)
im.show()

It makes all the points the same color which makes it difficult to judge multiple new points on a figure, but this was sufficient for my purposes.
The plus and minus 0.001 was found by trial and error to make the correct sized dots on the map.

The tool was just for me but my wife also appreciated that we could use this to quickly go through the initial barrage of home listings
and weed out the listings that were for sure not going to be of interest.

Not too bad for a few hours of work and most of that was just deciding and drawing out the regions of interest.