br This EMR had fields
This EMR had 169 fields out of which 143 fields were retained after de-duplication. Further, these Paclitaxel (Taxol) 143 fields were scrutinized manually, and fields which contain information encoded in other fields, or which contained information obviously irrelevant to the problem were re-moved. For example, we selected ‘Number of years of tobacco chewing’ and discarded ‘Tobacco chewing (Y/N)’. Also, we removed some of the fields for their large variation like ‘pin-code’, ‘mobile number’, etc. The final EMR had 49 high quality fields. There were a few missing values in these records (1.308% of all the values) which were populated to value based on rule or average. For example, ‘sex’ field was null for 2 in-stances but filled with ‘female’ because we get value in ‘Last Menstrual Period’ field for these instances. We imputed missing values for all of the fields using such rules, except ‘height’ and ‘weight’. For ‘height’ and ‘weight’, we imputed missing values using average values for respective field, a total of 10 instances for each.
Some of the fields were nominal, having multiple values e.g. ‘mar-ital status’ (Married, Unmarried, Widow. etc.), ‘Religion’ (Buddhist, Christian, Hindu, Muslim, others), ‘occupation’ (7 values) etc. We have split each of these nominal fields to multiple binary fields, one for each value of the nominal field. For example, ‘Marital Status’ field is split into ‘Marital Status-M’ for married, ‘Marital Status-U’ for unmarried and so on. Thus we constructed 64 fields(including label) contains value 1 or from original 49 nominal fields and call these organized binary medical fields as features.
The features have been shown in Table 1 according to various ca-tegories along with some examples in each category. We define our problem as a classification problem assigning each patient (as indicated in label) either ‘Abnormal’, if there is an esophageal-cancer suspect and ‘Normal’ otherwise. The presence of ‘Oral cavity’ as diagnosed by a doctor, after consulting the ‘basic medical tests’ and Barium swallow test results, which are also part of dataset. So, the positive do not belong to confirmed cancer patients but the ‘suspected’ ones. Although the patients were referred for either FNAC or Biopsy but the result of those are not available.
Discussion on bias: As mentioned earlier, although there is no self-selection bias in the dataset, there is some bias induced due to selection by the paramedical staff. This can also be observed from mean values of features reported in Table 1. For example, the mean age of patients in the dataset is 50.76, whereas the mean age of general population in India is ∼28.4 Also, the people having esophageal cancer are over-re-presented in the dataset, ∼2.5% compared to an estimated 0.06% in the overall population. Following standard practice, we have tested our methods using a dataset having similar bias. However, we feel a system developed using this biased data can be used with general population as
4 https://en.wikipedia.org/wiki/List_of_countries_by_median_age. Artificial Intelligence In Medicine 95 (2019) 16–26
Statistics for category wise various features.
Feature category/ Continuous (C)/ Training Testing
Mean, SD for
Mean for binary
Cancer case in
Cancer death in
Basic clinical test
RBSL By Glucometer
Alteration Of Voices
well, since the persons screened out by the paramedical staff are ob-viously low risk patients.