Notes on codes, projects and everything

Statistical Analysis for Social Audit Project

This is the formal draft of my statistical analysis report for the social audit project previously mentioned here. As the project is public by nature, I am cross-posting here for own reference.

Methodology

The result of the survey which started on 10th of November 2020, and then collected on 23rd of November for analyzing. 110 responses were collected from the residents of USJ 16 Sri Tanjung Flats for this analysis. The responses first went through a series of preprocessing and visualization with the use of Jupyter notebook <https://jupyter.org/>, which is a tool to visualize and manipulate data. The collected data then gets processed by code written in Python, a programming language <https://www.python.org/>. After a round of preprocessing (spelling errors, aggregating responses with the same meaning), they are then counted and tabulated into dataframes <https://pandas.pydata.org/> for further processing.

In this study, we are interested in finding out if a person’s age, gender, employment status, household income and education level their access to the Internet. One thing to note is that the field for gender was added at a later stage of the response, hence for that factor there were 28 entries that were excluded for that study. The list of questions that were chosen for the study of internet access are as follows (in no particular order):

  1. How does a respondent access government service?
  2. What internet-enabled devices does a respondent own and use?
  3. Whether a respondent gets to access the internet as much as desired?
  4. Has a respondent used public Wi-Fi service?
  5. Has a respondent used public Wi-Fi service provided by the government (e.g. Wi-Fi Selangorku, Wi-Fi Smart Selangor by the Selangor state government)?
  6. Whether a respondent is able to and can afford to work from home?
  7. Whether a respondent can comfortably work from home?
  8. Can family members of a respondent learn online, if applicable?
  9. Are the family members of a respondent able to learn comfortably online (if applicable)?
  10. How much would a respondent pay a month for internet access service?
  11. How much would a respondent pay for a computer or laptop (notebook computer)?
  12. How much would a respondent pay for a smartphone or tablet?

Selected answers with respect to the previously mentioned factors were tested for independence via chi-square test at alpha=0.05. The null hypothesis for the test was the two variables (e.g. gender and how does a respondent access government service) are independent, and then the alternative hypothesis was the two variables are not independent, or they are associated.

If the null hypothesis is rejected for a given pair of variables, we can then proceed with another test to see how changes in one affects the outcome of another. Correlation analysis was done for pairs involving age, household income and education level, as we could rank the responses accordingly. The correlation coefficient for the pairs of variables are then calculated with Spearman’s rank correlation coefficient, as age, household income (which we aggregated the values of these two variables into bins) and education level cannot be represented in numeric form.

After obtaining the coefficient where the degree and type of correlation (e.g. weak positive correlation, very high negative correlation etc.) is determined, we proceed to perform hypothesis testing for correlation to see if the correlation is significant. Student test, or t-test at alpha = 0.05 was conducted for this test as the observed distinct pairs of both variables was fewer than 30. For the significant test for correlation, the null hypothesis was the correlation between variables are significant, and the alternative hypothesis was the correlation are insignificant.

However, for gender employment status, as we cannot rank the responses to these variables, correlation analysis therefore could not be performed. A test to find the difference in mean between groups of responses were to be performed, however possibly due to the insufficient responses or a flaw in the questionnaire we were unable to perform this test as none of the pair of variables were associated.

The analysis result that shows pairs that are associated, as well as their result for correlation, are shown in the following section.

Analysis result

Budget for internet access is binned to the following groups (in Malaysian Ringgit)

  1. Greater than -0.001 and less than or equal to 50.0
  2. Greater than 50.0 and less than or equal to 150.0
  3. Greater than 200.0 and less than or equal to 250.0
  4. Greater than 250.0 and less than or equal to 300.0
  5. Greater than 350.0 and less than or equal to 400.0
  6. Greater than 400.0 and less than or equal to 450.0
  7. Greater than 450.0 and less than or equal to 500.0

When we compare household income to the budget for internet access (refer to the table and graph below), we found these both variables are associated at test statistic 64.9907, which is greater than the critical region of 41.3371. As mentioned the alpha value chosen for independence study was 0.05, which means if we repeat the same survey and analysis, 95% of the time we would draw the same conclusion. The following table and graph shows the observed count for the different pairs of values for the two variables (rows and columns with only zeros are removed from the table).

budget internet(-0.001, 50.0](50.0, 100.0](100.0, 150.0](150.0, 200.0](450.0, 500.0]
Household income
(-8.001, 1000.0]52000
(1000.0, 2000.0]215000
(2000.0, 3000.0]2010200
(3000.0, 4000.0]1310011
(4000.0, 5000.0]52010
(5000.0, 6000.0]03010
(6000.0, 7000.0]00120
(7000.0, 8000.0]02000

For the budget in the (-0.001, 50] bin, the negative correlation between household income and itself is high at -0.7856, which means an increase in household income would yield a lower count in this bracket of budget. However, the correlation is not significant as the test statistic is at -2.1995 which is within the critical region ±3.1824. Therefore, there is no significant correlation between the household income and the monthly budget for internet access in the population (i.e. all residents of USJ 16 Sri Tanjung Flats).

There is some negative correlation at -0.4542 for budget in the (50.0, 100.0] bin. However, there’s also no significant correlation in the population, with the test statistic -0.8830 and the critical region ±3.1824,

A very low correlation is found at 0.0623 for budget in the (100.0, 150.0] bin, which means an increase in household income also decreases the number would yield higher count in this bracket of budget. However, the correlation is also insignificant at the same alpha value with the test statistics of 0.1082 which is within the critical region ±3.1824.

There is some correlation for budget in the (150.0, 200.0] bin where the correlation is 0.4956. However, the null hypothesis is failed to be rejected as well here where the test statistic is 0.9882 and it is within the critical region of ±3.1824.

Lastly there is a negative correlation which is very low at -0.0825 for budget in the (450.0, 500.0] bin. As other cases mentioned above, the correlation is not significant with the test statistic -0.1433 and critical region ±3.1824.

There were 5 different types of devices observed in the survey response, in which are owned and used by the respondents, namely:

  1. Desktop computer
  2. Laptop (notebook computer including netbooks)
  3. Tablet computer (e.g. Apple iPad, Samsung Galaxy Tab etc.)
  4. Gaming console (e.g. Sony PlayStation, Nintendo Switch, Microsoft Xbox)
  5. Smartphone
  6. Smart TV
  7. Android TV Box (Set-top boxes running Android as the operating system)

The null hypothesis is rejected for education level and types of devices owned and used by respondents where the test statistic is 15.7739, which exceeds the critical value 12.5915.

Desktop computerLaptopTablet computerGaming consoleSmartphoneSmart TVAndroid TV box
Educationlevel
Secondary school49544611
Highereducation15371746000

There is a perfect correlation between the education level and ownership and use of a desktop computer at 1.0000. This means when the education level increases, the observed count of ownership and use for desktop computer is higher. The test for significance also shows the correlation is significant where the test statistic is134217728.0000, which is greater than the critical region ±2.7764.

Similarly, another perfect correlation between the education level and ownership and use of a laptop at 1.0000. The correlation is also significant where the test statistic is 134217728.0000, which is greater than the critical region ±2.7764.

The perfect correlation is also found when the device is a tablet at 1.0000. As like the previous cases, the correlation is significant where the test statistic shows 134217728.0000 which is higher than the critical region ±2.7764.

Another perfect correlation is found for a smartphone at 1.0000. The test of significance also rejects the null hypothesis where the test statistic 134217728.0000 exceeds the critical region ±2.7764.

Meanwhile, for a smart TV, the negative correlation is perfect at -1.0000, which means that whenever the education level increases, the observed count for smart TV ownership decreases. The correlation is significant here where the test statistic is -134217728.0000 which is less than the critical region ±2.7764. However, as this is one of the options where the respondent added to the survey, chances are other respondents weren’t aware of this feature when they filled in the questionnaire.

Lastly, the negative correlation is perfect at -1.0000 for Android TV Box. The correlation is also significant as the test statistic is 134217728.0000 and critical region ±2.7764. As the case of smart TV, the respondent might not be aware of the fact that they could add an option to their answer, hence the low count and hence the perfect correlation score.

The study with respect to whether a respondent used public Wi-Fi service provided by the government was also done. In this question, the respondent would answer either a yes, no or not sure/I don’t know.

The study of independence showed the prior experience with this government service is dependent on the education level of a respondent at test statistic 6.1483 which is higher than the critical value 5.9915.

used_selangor_Wi-FiNoYesNot sure / I don’t know
education_level
Secondary school3692
Higher education33253

The negative correlation is perfect at -1.0000 for cases where respondents had no experience using Wi-Fi services provided by the government. The negative correlation means the increase in education level decreases the count of responses answering No. The study of significance is recorded at -67108864.0000 which is lower than critical region ±12.7062.

For cases where the respondent used Wi-Fi services provided by the government, the correlation is perfect where the coefficient is 1.0000. Since the coefficient is positive, that means the increase in education level increases the count of responses answering no. The correlation is also significant where the test statistic is 67108864.0000, which is higher than the critical region ±12.7062.

Lastly, the correlation is also perfect at 1.0000 for cases where a respondent is unsure or does not know about the service. The correlation is significant where the test statistic is 67108864.0000 which is also higher than the critical region ±12.7062.

When we study age groups and the prior experience, we can also find these two variables are associated, the test statistic is 22.8679, which is greater than the critical value 18.3070.

NoYesNot sure / I don’t know
Age
(-0.001, 10.0]001
(10.0, 20.0]230
(20.0, 30.0]1482
(30.0, 40.0]34172
(40.0, 50.0]1661
(50.0, 60.0]400

The correlation is high at 0.6 for the case where respondents answered No, which means the increase of age yields higher count answering no. However, the correlation is not significant where the test statistic is 0.1167 and is within the critical region ±12.7062.

For the case where respondents answered yes, the correlation is very low at 0.1160. The correlation is however not significant as the test statistic is 0.1167 which is within the critical region of ±12.7062.

Lastly there is a negative correlation seen where the respondents answered unsure/I don’t know with the coefficient is -0.1195. This means fewer older respondents answering this option to the question. The correlation is also not significant as the test statistic is -0.1204 and critical region is ±12.7062.

On the other hand, the question was also generalized to find if a respondent had prior experience in using a public Wi-Fi service. An association is found when the result is compared to the age groups of respondents where the test statistic is 31.1042 and critical value is 18.3070.

NoNot sure / I don’t knowYes
Age
(-0.001, 10.0]010
(10.0, 20.0]203
(20.0, 30.0]12012
(30.0, 40.0]23228
(40.0, 50.0]1418
(50.0, 60.0]301

A high correlation is found at 0.6 where the respondents answered no, which means increase in age yields higher count for this answer. However, the correlation is not significant where the test statistic is 0.75 and critical region is ±12.7062.

The negative correlation is very low at -0.0308 for answer unsure / I don’t know. The negative correlation here indicates increase in age yields lower count for this answer. However, the correlation is not significant as the test statistic is -0.0308 which is within the critical region ±12.7062.

Meanwhile, the correlation is low at 0.2571 for yes as response to this question. The correlation is also not significant i.e. the test statistic is 0.2660 which falls within the critical region ±12.7062.

The ability to work from home is also another question included in this study. The test of independence rejects the null hypothesis when we compare age groups to the ability to work from home. The test statistic was calculated to be 22.4093 and the critical region is 18.3070.

NoYesNot sure / I don’t know
Age
(-0.001, 10.0]100
(10.0, 20.0]023
(20.0, 30.0]0213
(30.0, 40.0]11348
(40.0, 50.0]4145
(50.0, 60.0]220

There is some correlation at 0.5798 for the answer no, which means increase in age yields more count for this answer. However, the test of significance finds this correlation is not significant where test statistic is 0.7116 and critical region is ±12.7062.

A low correlation is found for the answer yes where the coefficient is 0.3769. The correlation, however is not significant as test statistic is 0.4068 and critical region is ±12.7062.

Lastly, the correlation is also low for the answer not sure / I don’t know at 0.2060. The correlation is also not significant, where the test statistic 0.2105 falls within the critical region ±12.7062.

Future enhancements

As most of the independence tests failed to reject the null hypothesis, it likely indicates 

  1. There are not enough responses
  2. There is are flaws in the design of questionnaire, for example only 1 respond for the smart TV and android TV box ownership

If this study is to be repeated in the future, more encouragements to responses would need to be done. This should also help in correlation significant test when there are more responses.

Some respondents likely are unaware of the presence of “other…” option in the survey, or they did not think their answer is relevant. The question could be more descriptive so respondents would know what can be filled in.

Conclusion

Out of the 12 questions we analyzed with respect to the 5 classes of characteristics of respondents, the study found association to the following pairs of variables

  1. Household income vs monthly budget for Internet access service.
  2. Education level vs ownership and use of different types of Internet-enabled devices
  3. Education level vs prior experience with public Wi-Fi by state government
  4. Age vs prior experience with public Wi-Fi by the government
  5. Age vs prior experience with public Wi-Fi
  6. Age vs ability to work from home

Besides that the study also found correlation between the pair of variables in various degrees.

The Jupyter notebook of this analysis is published and can be found at <https://github.com/Jeffrey04/social-audit-2020/blob/master/report.ipynb>.

leave your comment

name is required

email is required

have a blog?

This blog uses scripts to assist and automate comment moderation, and the author of this blog post does not hold responsibility in the content of posted comments. Please note that activities such as flaming, ungrounded accusations as well as spamming will not be entertained.

Click to change color scheme