27593 Methods for Mining and Analyzing Web-Log Data by Example: Data From a Web-Based Study for Newly Diagnosed Prostate Cancer Patients

Linda Fleisher, PhD, MPH, Office of Health Communications and Health Disparities, Fox Chase Cancer Center, Cheltenham, PA and Venk Kandadai, MPH, Office of Health Communications and Health Disparities, Fox Chase Cancer Center, Cheltenham

Theoretical Background and research questions/hypothesis: Given the extensive utilization of the Internet for health information, web-based health promotion interventions are widely perceived as an effective communication channel.  Most studies involving web-based interventions report on the efficacy of the interventions themselves and very few studies report on the methodological techniques used to study the usage patterns of the actual interventions.  This study reports on the methodological techniques used to mine and analyze web-log data from a web tracking study to understand patterns of use of a prostate cancer treatment decision aid in order to develop variables for hypothesis testing.

Methods:  The web-based decision aid had tracking capability and we obtained web log data from 56 male participants.  First, web log cleaning was required to remove any extraneous or irrelevant data.  Although the web logs produce volumes of readily quantifiable data, there is an element of subjectivity in the selection and refinement of specific data to be analyzed.  Then, operational terms were developed to standardize specific usage variables.  For example, decision rules about the minimum time requirements to constitute “a session” are purely subjective.  Finally, the investigator created additional variables such as type of medium (text vs. video) and content type (radiation vs. surgery).  Web-log data mining was conducted using the SQL (Structured Query Language) procedure in SAS v 9.2 (PROC SQL).  Given the large amount of web-log output produced, it was extremely essential to be able to query the data to find anything that could be extraneous to the analysis.  For example, when creating a variable for actual usage in units of time, it was essential to not include time spent in idle mode and page loading.  To be able to search for these pockets of extraneous and/or meaningful time in relation to participant data, a somewhat sophisticated analytical process needed to be set in place.  In addition to defining the components and key aspects of the content, definitions of usage were created.  For example, the website had to be accessed for at least one minute to be counted as a session.  The same definition was used for both initial and subsequent sessions. 

Results: We were able to quantify the level of “access” of a complicated, multimedia website and develop domains of use such as quantifying the degree of access of text-based and video-based web content.  These domains were ultimately used as variables for hypothesis testing. We were also able to separate out and better understand the differences between idle time and actual usage time.  The SQL techniques used were extremely efficient to mine web data and develop analytical datasets. 

Conclusions: The methods used in this analysis were effective for mining and developing variables from web-log data. 

Implications for research and/or practice:  Because little is known on how to analyze web-log data, it is important to develop novel techniques to better understand and quantify web use for scientific inquiry and to ultimately tailor these types of interventions for specific populations.