It’s been far too long since I’ve taken a course in statistics, I forget even what to google for.
To keep it simple: I have two columns in Excel with 500k rows, each row is a response to a survey question (and of course we have dozens of questions, but I’ll focus on one).
First column (A1-A500000): Group, which can be 1, 2, or 3.
Second column (B1-B500000): On a scale of 1-10 answer to a question. (10 is happy, 1 is sad)
So the first row is, say, [1, 8]. This means an anonymous responder from Group 1 gave a score of 8 for this question.
We have found that responders in Group 3 never give a score below 7. The assumption is there is something about that group that makes them happy (or something about being happy that drives them to that group - not seeking causation). However of hundreds of thousands of responders, only a handful (20-100 - depending on the question) are in Group 3.
How can I demonstrate there is something special about Group 3 (or if that conclusion is not justified, how do I demonstrate that)?
I think this is called statistical significance.
Really my question is: What formula can I enter into Cell D1 to know whether the fact no member of Group 3 responds less than 7 means anything in understanding Group 3? (e.g. if I enter this monstrous formula, if the answer is less than x, you’ve discovered a hidden truth - or if it’s y the conclusion is meaningless)
I would be surprised (though eternally grateful) if anyone wants to actually work out the formula. Could you point me to a tutorial a layman can grok? Any nudge is helpful, and any monstrous Excel formula is simply golden.
But this is going to be super pain to do in a spreadsheet, and in fact this is why public companies like Tableau exist.
One free option to consider, which may seem more difficult now, but will actually be easier than trying to do this work in spreadsheets is to learn seaborn/python. This can and does run in a browser based jupyter notebook and has far better data visualization and manipulation possibilities. Heck it will should be able to import your spreadsheet or csv.
As this grew out of the science world it has awesome functions, like violin plots as described on this page.
Note most of the complexity in the example code they provide there is for formatting so that it fits their website’s design.
There are several tutorials, and there is one here.
This may seem hard, but wrangling excel is actually more of a challenge and by using these pre-built tools you get to use algorithms written by professional applied mathematicians, many of which work at places like LLNL etc…
They also run if the cloud if so desired, but these are just those same jupyter notebooks.
Note I don’t know too much EXCEL so maybe someone here can help with the actual Excel indicator functions.
Assuming you want to concentrate on the cutoff of 7*, then here might be an easy way to do this
Create a new column that is 1 if happiness is greater than 7, 0 otherwise. second create more three columns that are (0,1) indicators of being in group 1,2,3. Multiply the first column by each of these columns and sum to get the number of sample in each group that have happiness greater than 7. Repeat this for a less than or equal to 7 indicator column to get the number less than 7 in each group. fit these 6 values into a 2x3 table where the columns represent the groups and the rows indicate greater or less than 7.
Then follow this webpageto calculate a chi-squared statistic.
Alternatively you could also plug the values into an online calculator.
*(which I might not recommend since you choose that cutoff based on looking at the data so your results might be biased, but worst case scenario you can be very conservative and multiply your p-value by 9 to consider all possible cut offs)
Depends on your Excel version, but something like:
Sort by column A
In a blank cell, do =T.TEST(allthe1s,allthe2s,2,2). Old versions of Excel just use =TTEST without the period.
If the resulting number is less than 0.05, those two groups are significantly different.
This brings up lots of caveats:
[ul]
[li]Can only compare 2 groups at once with this, so repeat for allthe1s vs. allthe3s and allthe2s vs. allthe3s.[/li][li]Read up on correcting for multiple comparisons, and familywise error. Probably not a huge concern with only 3 comparisons.[/li][li]If the variances are not considered equal (don’t think there is a built in formula for this), change that last 2 to a 3. Try both, hopefully your answer doesn’t change.[/li][li]The first 2 in the formula is probably okay, unless you can make a case for using 1 (one-tailed test).[/li]
[/ul]
For more sophisticated analyses (e.g. Anova), you’ll need to turn on the Data Analysis toolpak
File > Options, click Add-Ins, click Go… at the bottom. Check Analysis ToolPak, click OK. You’ll only have to do this once per Excel install.
The toolpak is now in Data > Data Analysis
If you figure out how to install python and pip, it should be just:
pip install --user seaborn pandas numpy jupyter
Although --user is to avoid it installing these new packages in the system packages, just read the seaborn install docs.
You can test excel like this, which uses a Microsoft provided example Excel spread sheet.
Running the command called jupyter-notebook should pop up a browser window but refer to the docs there.
But to show you how easy this is, this is all the code that is required to download and make a chart of an Excel spreadsheet.
import pandas as pd
import seaborn as sns
sns.set(style="ticks")
df = pd.read_excel('http://go.microsoft.com/fwlink/?LinkID=521962')
sns.violinplot(x='Country', y='Profit', data=df)
That will produce the following chart, which is not that earth shattering with this data but would provide what you asked.
If your replace the url with the path to your file, and change ‘Country’ and ‘Profit’ to the names of your columns it should work for your spreadsheet.
Is this something like a Friedman test or Pearson test? Why wouldn’t a chi-squared distribution apply? You can easily compute the frequency of each score if necessary.
Re-reading the OP, maybe a Kruskal-Wallis test (assuming this is appropriate) followed by pairwise comparisons to demonstrate Group 3 is the different one?
After re-reading the OP I realized that there are multiple answers and other questions.
Scatter plots are useful, and while with two rows you won’t have enough dimensions for the matrix it is really useful for a quick view into data. https://seaborn.pydata.org/examples/scatterplot_matrix.html
If you review this document it will give an overview on why that type of matrix is useful for the discovery phase.
And while I didn’t share it before to avoid information overload, there are 100s of pre-created, tested functions if as stated above, chi square, isn’t appropriate.
To be clear, I am not trying to witness for python, but just suggesting the most popular platform, with the most easy to use tools.
If we ignore the subject of this post, as that would need to be in GD, and just think of how difficult it is for even professionals to avoid errors in Excel.
(which as fate may have it has a link back here to another thread on the dope in the comments)
Kruskal-Wallis/3 Mann-Whitney U tests would be fine. Less power than a t-test/ANOVA but accounts for assumptions. But a lot harder for the OP to implement simply than a t-test as there is no built-in Excel function. I made some spreadsheets for these tests but had to use multiple formulas.
I’m not sure if a chi-squared is appropriate or not, it’s doable considering the sample size. But it’s more complex to implement than is necessary, I’d think.