北京科技大学STATA 应用学习摘录2第一章 STATA 的基本操作 一、设置内存容set mem 500m, perm一、 显示输入内容Display 1Display “clive”二、 显示数据集结构 describeDescribe /d三、 编辑 editEdit四、 重命名变量Rename var1 var2五、 显示数据集内容 list/browseList in 1List in 2/10六、 数据导入:数据文件是文本类型(.csv)1、 insheet: . insheet using “C:\Documents and Settings\Administrator\桌面\ST9007\dataset\Fees1.csv”, clear2、 内存为空时才可以导入数据集,否则会出现(you must start with an empty dataset)(1 ) 清空内存中的所有变量:.drop _all(2 ) 导入语句后加入“clear”命令七、 保存文件1、 save “C:\Documents and Settings\Administrator\桌面\ST9007\dataset\Fees1.dta”2、 save “C:\Documents and Settings\Administrator\桌面\ST9007\dataset\Fees1.dta”, replace八、 打开及退出已存文件 use1、 .Use 文件路径及文件名, clear2、 . Drop _all/.exit九、 记录命令和输出结果(log)1、 开始建立记录文件:log using "J:\phd\output.log", replace2、 暂停记录文件:log off3、 重新打开记录文件:log on4、 关闭记录文件:log close十一、创建和保存程序文件:(doedit, do)1、 打开程序编辑窗口:doedit2、 写入命令3、 保存文件,.do.4、 运行命令:.do 程序文件路径及文件名十二、多个数据集合并为一个数据集(变量和结构相同)纵向合并 appendinsheet using "J:\phd\Fees1.csv", clearsave "J:\phd\Fees1.dta", replaceinsheet using "J:\phd\Fees2.csv", clear3append using "J:\phd\Fees1.dta"save "J:\phd\Fees1.dta", replace十三、横向合并,在原数据集基础上加上另外的变量 merge1、 insheet using "J:\phd\Fees1.csv", clearsort companyid yearend save "J:\phd\Fees1.dta", replacedescribeinsheet using "J:\phd\Fees6.csv", clearsort companyid yearend merge companyid yearend using "J:\phd\Fees1.dta"save "J:\phd\Fees1.dta", replacedescribe2、_merge==1 obs. From master data _merge==2 obs. From using data_merge==3 obs. From both master and using data十四、帮助文件:help1、. Help describe十五、描述性统计量1、 summarize incorporationyear 单个summarize incorporationyear-big6 连续多个summarize _all or simply summarize 所有2、更详细的统计量summarize incorporationyear, detail3、centilecentile auditfees, centile(0(10)100) centile auditfees, centile(0(5)100) 4、tabulate 不同类型变量的频数和比例tabulate companytype tabulate companytype big6, column 按列计算百分比tabulate companytype big6, row 按行计算百分比tab companytype big6 if companytype=0 & lnaf=0 & lnaf=0 & lnafmax & cook04、当左右两边均截取以后,也可使用 tobit 模型 gen lnnaf1=lnnaf replace lnnaf1=5 if lnnaf>5 & lnnaf!=. tobit lnnaf1 lnta if miss==0, ll(0) ul(5) tobit lnnaf1 lnta if miss==0, ll ul (如果截取数字是样本中的最大和最小值不用列出,系统会自动选取)。
tobit lnnaf lnta if miss==0, ll ul(5) robust cluster (companyid)(控制异方差和时间序列不独立)243.8 Duration models(生存模型)1、适用数据:因变量测试某一事件持续的时间例如: Duration of life (medical, engineering)• how long do people live for?• how long do machines last? Duration of unemployment (economics)• how long do people remain unemployed? • for example, we may be interested in how retraining schemes affect the duration of unemployment Duration of CEO tenure (management)• how long does the CEO stay at the same company? Duration of auditor-company tenure (accounting)• how long do the company and audit firm stay together?2、度量变量: The “hazard rate”, h(t), is the probability that the event will occur in period t, given that it has not occurred up to time t.3、使用命令 stset timevar use "J:\phd\kva.dta", clear list stset failtime该语句产生四个内部变量:显示变量:list failtime _st _d _t _t0 The _st variable is a dummy equal to one for observations whose data has been stset (e.g., there would have been some zero values if we had excluded some observations using the if qualifier) The _d variable 是否改变状态 The _t variable 生存时间 The _t0 variable 生存起始点,默认为 04、用 Cox proportional hazards model 估计 命令:stcoxstcox load bearings(load bearings 两个变量是影响生命的两个因素) The reported hazard ratios are the exponentials of the coefficients. The hazard ratio for load = 1.52647 = exp(a1) where a1 is the coefficient on 25loada1 = ln(1.52647) = 0.4229578The coefficient on bearings = ln(0.0636433) = -2.754461The load coefficient is significantly positive implying that the machines fail more quickly (higher hazard rate) when they are under greater stressThe bearings coefficient is significantly negative implying that the machines fail less quickly (lower hazard rate) when they use the new-type of bearing. 如果想让系统报告系数而不是 H(T )系数,可使用以下命令stcox load bearings, nohr5、 解决 ties 问题的模型之一:breslowThe Breslow method is very fast and is the default method that STATA uses for resolving ties.如果生存时间相同时,就形成一个 ties.命令集:stcox load bearings, breslowstcox load bearings, efron6、 解决 ties 问题的模型之二:efron7、 该方法比上一个方法更准确,但用时较长。
将两个同样的死亡时间各分 0.5 的可能性8、 当存在 censoring 时,即并不是所且有的样本都死亡时,需要在命令中加选项stset failtime, failure(failed)The failtime variable gives the time of failure or censoringThe failed variable indicates whether failure or censoring occurredSTATA assumes censoring if failed equals zero or is set to missing 9、 以上均是处理一个事件只占一行的情况,当事件某一特性改变时,就需要多行来描述这时需要在告诉系统以下数据为生存数据的命令中加入选项,事件代码stset t, id(patid) failure(died) 10、 当 Left-censoring occurs,这时需在说明生存命令中加入开始时间变量stset end, id(id) failure(died) enter(begin)11、 当中间部分时间的数据缺失时的处理:需要说明死亡时间、变量标识,死亡标识,开始时间。
stset end, id(id) failure(died) enter(begin) 12、 为消除 heteroscedasticity and time-series dependence ,可以在回归命令的最后加上 robust 和 cluster().stcox x1, robust cluster(id)26小结:根据因变量的类型选择不同的回归模型Dependent variable (Y)Examples Estimation meth。