Asking the right questions

Have you ever had someone tell you: “Oh. That’s a good question.”

If you haven’t, you  should try it. It is quite nice. But which of these do you  think are good questions?

1.”What does this button do?”

If your response to number 1 is to push the button, you might be about to have a bad day. But, if you are honestly asking that question to an expert on the operation of said button, it is the best question to be asking. This is the truth behind the statement from Newton – “If I have seen farther it is because I have stood on the shoulders of giants.” Ask a subject matter expert, good question. Follow this question up with questions like “what should I see when I push this button?” and “What do I do if I don’t see that?”

2. “Does this <thing I am doing> even matter?”

This value based, pragmatic, critical thinking question is at the heart of all good questions. It helps keep you motivated if the answer is yes and if the answer is no, it helps you spend your time more wisely. Good question. Keep asking, ‘Is this important to me personally or professionally?” or “Is this important to some stakeholder in the <thing I am doing>?” This could be your wife, employer, etc.

Cost and benefit or risk and reward analysis is also super important in analysis. In science, statistics and data analysis one can usually come up with a new angle or question to ask of the data or continue trying to squeeze every ounce of precision available to you. For some data sets this matters. Let’s say that there are relatively straightforward means to be 90% confident in the insight from some data.  If you want 99%, it is going to take 10-100x more effort to worry about that last little bit. A good question asks, “is it worth it?” If you see my earlier post about shopping for a vacuum cleaner, it was worth it in that case. 

3.”When did you stop beating your wife?”

This is a classic example of a leading question. It may serve its purpose in interrogation or parenting, but it is devastating in data analysis or decision making. If you want honest inquiry into a subject, you have to work very hard to combat biases. This is really challenging, part of a scientific mindset is to hypothesize first. Make a guess and check that it works. But that mindset very easily leads to confirmation bias if one is not careful to follow it up with the next question.

4.”What are the chances I could be wrong?”

Great question. The answer is non-zero. Failure is always an option. But glorious ‘failure’ that leads to new insight, inventions and to the right answer eventually. But this is the philosophical success of Bayesian reasoning, my confidence can approach but never reach 100%. The trouble is, this mindset needs to be practiced rigorously. It is much more natural to round off likely-hoods and think in terms of 0%,50% or 100% and no nuance in between. 

5.”What is the tallest mountain in Europe?”

This kind of academic or quiz question has its place in the classroom where subject matter expertise needs to be assessed. Even in a classroom setting, this kind of question doesn’t promote creativity or honest inquiry about a subject. But in a professional setting, there is also a kind of question that a person might ask that they know the answer to and they are asking because they want to show off. Furthermore, instead of just making a comment, they ask in such a way as to make the speaker squirm. For my money, this is the worst kind of question.

N

One Day Build – Life Expectancy Comparison a la SQL and Python

I was inspired today to continue learning. Thanks to some folks over at Penny University (pennyuniversity.org), I found a quick little learning opportunity. I am focused this month on learning some more skills in SQL and in JS. A list and map of life expectancy data was posted to our slack channel (wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy) and showed some very interesting level of detail. County and census track level analysis had been done for the first time (https://www.pnas.org/content/117/30/17688) and this paper indicated that life expectancy varies at an even smaller, very local scale in the US. So, that was a neat read. But I noticed two of the tables in the wikipedia article, one that shows the break down of these numbers by state and another by the 50 largest US cities. I had a guess that the largest population centers in the US had a large effect on the data of the states in which they are located. This looked like an opportunity to try to do two things:

  1. Get a statistic for how similar the city data and the state data are for states where the 50 largest cities are located.
  2. Plot the same in python and hopefully get the same answer.

Half of this build was spent just cleaning up 5 data sets I grabbed related to this idea, 3 from the wikipedia and 2 from the CDC. I’ll probably keep playing with this so it was neat to try taking 5 .csv files and dumping them into a MySQL database. I got to remind myself how to regex find and replace with back references in gedit and that was kinda terrible, but with the datasets cleaned up sufficiently I could make a script for building the tables in a new database (No really, this took a while). I learned today about mysqlimport and the use of –local and I learned another way via ‘LOAD DATA INFILE’. I learned the use of ‘FIELDS TERMINATED BY’ because some of my files were tab separated and some were comma separated.

My first gotcha came when I learned about DECIMAL declarations for the numerical fields I wanted to import. DECIMAL is really an integer by default and that’s weird to me. The default is to give you a rounded value until you tell it specifically how many digits there are in total and how many come after the decimal point. 

Then there was some more back and forth deleting and loading the data while checking on the warning messages. This was pretty straightforward. 

Then I had to come up with the interestingly complicated SQL call below:

select avg(d) as avgd from (select avg(s.LE2018)-avg(c.LE2018) as d
 from LEByState1018 s join LEByCity c on s.State=c.State group by s.State) t;
 +----------------+
| avgd |
+----------------+
| 0.037269585161 |
+----------------+
1 row in set (0.01 sec) 

I want to have the states that are in common between the two tables. I do that by from table1 alias1 join table2 alias2 on condition , then I can take the average on the grouping of states. The trickiest part for me was realizing I needed to have avg(s.LE2018) because in this table each of these only had one entry. But, there was a check against these entries being not aggregated in the group. Then I have to make sure to name everything and it works. I have my answer: 0.037 years or 13.5 days is the amount by which large cities differ from their states’ average on average.

But what is the breakdown? I want to see this stuff plotted. So I will see what I can do in python. There was a lovely post from https://plotly.com/python/v3/graph-data-from-mysql-database-in-python/. With the MySQLdb library and pandas I was able to drop my SQL tables into some dataframes with 4 lines of code easily implemented with no surprises.Comparison of States and Cities for LE

Comparison of 2018 Data for States and 50 largest US Cities for Life Expectancy. wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy copied 10-8-2020

 

I made some plots and could see that indeed in many cases there is a close correlation. The abscissa doesn’t mean a whole lot but you can see the main point. Everything trends pretty close to the mean +- about a year and a half. The obvious point of note is Virginia Beach having a six year lower life expectancy than Virginia as a whole. My guess is there is a skew towards DC. But who knows. If I decide to go deeper on this I might try to recreate the data from the PNAS paper for a next little challenge.