In school, we work mainly with tiny datasets we can type into our technology if all else fails. In practice, we generally start with raw data in a computer file. Often, getting the data into a form our technology can work with is a major undertaking. We will go through that once right now.
R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. You can access such a file named heartatk4R.txt. Download and save this file in the directory where the R program lives. The file looks like this.
Patient DIAGNOSIS SEX DRG DIED CHARGES LOS AGE 1 41041 F 122 0 4752 0010 079 2 41041 F 122 0 3941 0006 034 3 41091 F 122 0 3657 0005 076 4 41081 F 122 0 1481 0002 080 5 41091 M 122 0 1681 0001 055 6 41091 M 121 0 6378.6400 0009 084 7 41091 F 121 0 10958.520 0015 084 8 41091 F 121 0 16583.930 0015 070 9 41041 M 121 0 4015.3300 0002 076 10 41041 F 123 1 1989.4400 0001 065 11 41041 F 121 0 7471.6300 0006 052 12 41091 M 121 0 3930.6300 0005 072 13 41091 F 122 0 ¥ 0009 083 14 41091 F 122 0 4433.9300 0004 061 15 41041 M 122 0 3318.2100 0002 053 16 41041 M 122 0 4863.8300 0005 077 17 41041 M 121 0 5000.6400 0003 053
Above are only the first 17 cases (of 12,844). To use this in R you must define a variable to be equal to the contents of this file.
> heartatk = read.table("heartatk4R.txt",header=TRUE)
The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (These should not include spaces.) You can now get a table of contents for what you have created in R with
> objects()
This should return heartatk along with any other variables you may have created. You will not see on this list any of the variables that are inside of heartatk because they are hiding. To see them, type
> names(heartatk) [1] "Patient" "DIAGNOSIS" "SEX" "DRG" "DIED" "CHARGES" [7] "LOS" "AGE"
To bring them out of hiding, you must "attach" them to your R workspace. (This avoids conflicts if several tables include variables with the same name. Attach just one at a time.)
> attach(heartatk)
These data came from an ActivStats CD which provided this background information:
Heart Attack Patients This set of data is all of the hospital discharges in New York State with an admitting diagnosis of an Acute Myocardial Infarction (AMI), also called a heart attack, who did not have surgery, in the year 1993. There are 12,844 cases. AGE gives age in years SEX is coded M for males F for females DIAGNOSIS is in the form of an International Classification of Diseases, 9th Edition, Clinical Modification code. These tell which part of the heart was affected. DRG is the Diagnosis Related Group. It groups together patients with similar management. In this data set there are just three different drgs. 121 for AMIs with cardiovascular complications who did not die. 122 for AMIs without cardiovascular complications who did not die. 123 for AMIs where the patient died. LOS gives the hospital length of stay in days. DIED has a 1 for patients who died in hospital and a 0 otherwise. CHARGES gives the total hospital charges in dollars. Data provided by Health Process Management of Doylestown, PA.
After you attach the data table you can work with the internal variables providing you remember that R is case-sensitive.
> table(sex) Error in table(sex) : object "sex" not found > table(SEX) SEX F M 5065 7779
© 2006 Robert W. Hayden. Data Desk is a registered trademark of Data Description.