# Validity in Data Analysis

I do data analysis on a weekly basis for my thesis. I’m generating Healthcare Records, and I analyze the results. I then try to see if I have generated good results or not.

For starters, the generated data needs to have a similar dimension-wise probability. To calculate this, I average over all columns in my generated dataset and the original dataset. This is fairly straightforward and gives me the probability of seeing a certain code in a patient. If a similar probability exists on the generated dataset, then my data is good so far.

In the next step, I do some kind of sparse analysis. I count empty rows in both datasets and compare them. This shows if my neural network generates a few really sick patients and a lot of healthy ones that cancel each other out.

Lastly, I do typical AUROC, Precision/Recall, F1 score, etc. These are routine in my field and are well known.

To make sure I’m doing my analysis properly, I check my code with open source projects on the internet. Since I do not do any exceptional analysis, I have been able to look at similar projects and make sure I’m doing it properly. Additionally, I usually start with a smaller sample and test my math on it. If it checks out, I run it on my whole dataset. If I see a huge difference between them, I would know something is wrong with my analysis.