2010计算机就业前景分析及其展望
www.177liuxue.cn 来源:本站原创 发布时间:2010-10-04
18:52:54
新年伊始,先祝大家新年快乐,身体健康,心想事成。阳历新年已过,
阴历新年将至,在这迎虎送牛之际,我们一起来展望一下2010年的IT就业市场的前景及趋势。牛年,股市和IT市场没能牛起来,那么虎年,IT市场能否虎虎生风吗?在2009年……
新年伊始,先祝大家新年快乐,身体健康,心想事成。阳历新年已过,
阴历新年将至,在这迎虎送牛之际,我们一起来展望一下2010年的it就业市场的前景及趋势。牛年,股市和it市场没能牛起来,那么虎年,it市场能否虎虎生风吗?在2009年的经济危机的阴影下,金融和制造业受到严重的打击,it业也因此严重下滑。那么
2010年的it就业市场究竟会有什么样的趋势呢?哪些it技能可以成为就业市场的宠儿?哪些it证书又可以成为找工的金钥匙?
我想那些正在找工和准备换工的同胞们一定很想知道这些问题的答案。我希望这篇文章能给大家提供一些有价值的参考。
我们先看一下就业市场。尽管美国it就业市场还在紧缩,加拿大的it市场却已有所回暖。从几大加拿大job search网站的job posting数量上来看,我们可以清楚地发现,在2010年新年以后,it相关的job posting占所有行业之首,约占25%,由此可见一斑。从总体趋势看,2010年的it从业人员的需求是肯定在增长,增长幅度将缓慢放大,如果加国经济还可以保持现有的稳定,it就业市场将在十月左右会有明显大幅增长。所以,我们对加拿大的2010年it就业市场的态度还是比较乐观的。现在我们来看一下在it就业市场中哪些类型的工作比较热门。从美国和加拿大的一些市场调研机构的内参报告来看主要的市场需求比较集中在以下几个领域:项目管理(project management - pm),商业分析(business analysis - ba),企业架构(enterprise architecture - ea),商业智能(business intelligent - bi),企业安全(enterprise security ),企业风险管理及审计(enterprise risk management and audit)还有就是企业资源策划(enterprise resources planning - erp),当然最后就是软件/网站开发(software and web programming and developing)。具体的因素有以下几点:1.大中型企业由于市场疲软,投资回报及商业市场销售下降,因此项目开发的数量明显减少,项目成本及预算减少,但相对应项目质量的要求增加。在此种压力下,各个公司都需要有经验的项目经理及助理来管理项目,这就造就了市场的大量需求;2. 在现今激烈的市场商品经济的环境下,公司兼并,合作及外包是一个必然趋势。因此要求更多的项目管理,商业分析,企业架构,企业风险及安全的高端人才。例如,all stream 把非telecom的it顾问商业及雇员卖给price waterhouse coopers (pwc),而pwc已决定将加强其项目管理和商业分析的能力并增加雇员;3.各个企业在市场旺盛时期,所有的注意力都集中在抢占市场,忽略基础建设。但在市场疲软时,才发现坚实,现代化,有灵活性及易管理的基础平台和商业流程才是留住老客户和吸引新客户的关键,因此在现阶段,众多公司开始重建基础平台(it infrastructure platform)和商业流程(business processes)为以后市场恢复时能有更强的竞争力做准备。因此商业分析,商业智能,企业资源策划就成为重建的重点。在以上项目立项初期,商业分析及企业架构,企业风险和安全管理将是主要的工作。所以综合所述,以上各领域的技术人才将是2010年的主要争抢对象。至于开发及质量控制(quality control/quality assurance – qc/qa)而言,在2010年一直会有机会,但数量有限。但我们可预测,如果上述各领域的项目能够被批准,并分析和设计成功的话,在项目的中后期,将会需要大量的开发及质量控制人员。也就是说,在2010年第4季度和2011年,开发及质量控制的工作将大量涌现。
我们先看一下就业市场。尽管美国it就业市场还在紧缩,加拿大的it市场却已有所回暖。从几大加拿大job search网站的job posting数量上来看,我们可以清楚地发现,在2010年新年以后,it相关的job posting占所有行业之首,约占25%,由此可见一斑。从总体趋势看,2010年的it从业人员的需求是肯定在增长,增长幅度将缓慢放大,如果加国经济还可以保持现有的稳定,it就业市场将在十月左右会有明显大幅增长。所以,我们对加拿大的2010年it就业市场的态度还是比较乐观的。现在我们来看一下在it就业市场中哪些类型的工作比较热门。从美国和加拿大的一些市场调研机构的内参报告来看主要的市场需求比较集中在以下几个领域:项目管理(project management - pm),商业分析(business analysis - ba),企业架构(enterprise architecture - ea),商业智能(business intelligent - bi),企业安全(enterprise security ),企业风险管理及审计(enterprise risk management and audit)还有就是企业资源策划(enterprise resources planning - erp),当然最后就是软件/网站开发(software and web programming and developing)。具体的因素有以下几点:1.大中型企业由于市场疲软,投资回报及商业市场销售下降,因此项目开发的数量明显减少,项目成本及预算减少,但相对应项目质量的要求增加。在此种压力下,各个公司都需要有经验的项目经理及助理来管理项目,这就造就了市场的大量需求;2. 在现今激烈的市场商品经济的环境下,公司兼并,合作及外包是一个必然趋势。因此要求更多的项目管理,商业分析,企业架构,企业风险及安全的高端人才。例如,all stream 把非telecom的it顾问商业及雇员卖给price waterhouse coopers (pwc),而pwc已决定将加强其项目管理和商业分析的能力并增加雇员;3.各个企业在市场旺盛时期,所有的注意力都集中在抢占市场,忽略基础建设。但在市场疲软时,才发现坚实,现代化,有灵活性及易管理的基础平台和商业流程才是留住老客户和吸引新客户的关键,因此在现阶段,众多公司开始重建基础平台(it infrastructure platform)和商业流程(business processes)为以后市场恢复时能有更强的竞争力做准备。因此商业分析,商业智能,企业资源策划就成为重建的重点。在以上项目立项初期,商业分析及企业架构,企业风险和安全管理将是主要的工作。所以综合所述,以上各领域的技术人才将是2010年的主要争抢对象。至于开发及质量控制(quality control/quality assurance – qc/qa)而言,在2010年一直会有机会,但数量有限。但我们可预测,如果上述各领域的项目能够被批准,并分析和设计成功的话,在项目的中后期,将会需要大量的开发及质量控制人员。也就是说,在2010年第4季度和2011年,开发及质量控制的工作将大量涌现。
分析完就业市场的趋势,我们再来看一下具体的技术和技能需求情况。
就开发技术而言,java/j2ee, .net, c#,php,ruby or ruby on rail and python 占主要需求。系统集成和middletier而言,websphere,biztalk,
sharepoint还是主流。系统管理如unix,redhat linux
和windows server一直都是长青技术。数据库和数据仓库管理当中,oracle,
microsoft sql server及informatica占据主要市场。商业智能及数据分析方面,
sas还是领头
羊,但对data analysis方面技术要求越来越高。网络技术当数voip, 数据存储(storage), 虚拟化(virtualization) ,cloud,
saas 和网络管理(
network administration)最为热门。
下面我们再来探讨一下如何能够在就业市场中能捷足先登呢?当然技术及技能证书是必不可少的。哪些证书的价值最高?哪些证书最被雇主认可?it市场专业分析公司 foote partners公布的2010年十大it热
门证书为(以美国市场为标准):red hat
certified engineer, red hat certified
technician,cisco ip contact center express specialist,cisco
certified design expert ,
check point certified security administrator,giac certified
incident handler,giac security audit essentials,systems security certified
practitioner,sas certified advanced
programmer和sun certified
programmer for java。pmp和
itil是管理职位的热门证书。就加拿大而言,市场需求略有不同。加拿大雇主比较注重
开发技术证书和管理相关的证书,因此如java,j2ee,.net,oracle,websphere,ccnp,ccna,sap,pmp,cia,cisa,ba,itil等证书是比较被认可和受欢迎的。本文由一起
去留学编辑整理,转载自一起去留学http://www.177liuxue.cn转载请保留出处。
现在我们再来分析一下华人在it就业市场中的地位和适合的领域。华人在加主要以技术工作为主,主要是我们华人存在就业的两大致命弱点。英语交流能力差和北美技术工作经验缺乏。对于来加5年以
上,英语交流能力不错的华人而言,现在可以开始考虑进军技术高端及管理职位。而刚落地和来加不久的华人就要慎重考虑学习和就业的方向。大家不要有病乱投医。在现今社会里,不存在哪个技术学了
就能立刻找到工作,哪个证书考了就能立刻找到工作。例如,sap最近较热,但由于语言要求较高,因此很多华人花了很多钱去学,但始终都找不到相关工作。如果英语不是非常好的华人,不要好高骛远
去追风,要不将会花了很多的学费,寄予很高的期望,但结果却是非常令人失望的。大家切记不要盲目追求高薪,一定要把自己的基础建好再寻找发展机会。首先做好市场需求分析,根据个人喜好和能力,
在正确的职业培训和指导下,制定一个切实可行的职业计划,再加上个人的努力,我坚信大家一定会找到一份比较满意的工作。预祝大家2010年在事业和家庭上更上一层楼,心想事成!
作者:jet chen
超过12年加拿大it开发,部门经理和项目经理管理经验,曾就职于加拿大最大证劵管理公司,加拿大最大的保险公司及安省政府。并参与过中国外包项目策划及管理。对加拿大it行业发展和就业市场有相当深的研究和了解。
作者:jet chen
超过12年加拿大it开发,部门经理和项目经理管理经验,曾就职于加拿大最大证劵管理公司,加拿大最大的保险公司及安省政府。并参与过中国外包项目策划及管理。对加拿大it行业发展和就业市场有相当深的研究和了解。
ERP的下一个浪潮BI,当SAP遇到SAS。 收藏
在管理应用软件市场中,SAP可以说是一个绝对的领导者,从2001年度到2004年度,全球企业管理应用软件厂商TOP 100中,SAP一直稳居第一。然而,在2004年度全球管理软件厂商top 100中,有一家专业的商业智能软件公司——SAS突然跃升至第五位,如果考虑最近ORACLE公司并购了仁科,那么SAS将没有悬念地成为第四名。而在这TOP100中,其它专业的商业智能厂商占据了重要地位,比如:Cognos居第18位,Hyperion居第18位,Microstrategy居第47位,Business Objects居第51位等。这说明商业智能(BI)继ERP、CRM、SCM等之后逐渐得到了各行业的认同。而IBM、微软、ORACLE等国际上领先的大公司也纷纷开展了BI业务,促进了商业智能市场的繁荣。从2001年到2004年,SAS从起初的21名左右徘徊到短时间内的飞跃,是不是预示着BI市场的“井喷”?
SAS从美国到亚太
在管理应用软件市场中,SAP可以说是一个绝对的领导者,从2001年度到2004年度,全球企业管理应用软件厂商TOP 100中,SAP一直稳居第一。然而,在2004年度全球管理软件厂商top 100中,有一家专业的商业智能软件公司——SAS突然跃升至第五位,如果考虑最近ORACLE公司并购了仁科,那么SAS将没有悬念地成为第四名。而在这TOP100中,其它专业的商业智能厂商占据了重要地位,比如:Cognos居第18位,Hyperion居第18位,Microstrategy居第47位,Business Objects居第51位等。这说明商业智能(BI)继ERP、CRM、SCM等之后逐渐得到了各行业的认同。而IBM、微软、ORACLE等国际上领先的大公司也纷纷开展了BI业务,促进了商业智能市场的繁荣。从2001年到2004年,SAS从起初的21名左右徘徊到短时间内的飞跃,是不是预示着BI市场的“井喷”?
SAS从美国到亚太
SAS是提供下一代商业智能软件与服务以创建真正的企业智能的市场领导者。SAS解决方案拥有客户40,000余家——包括2003年财富500强企业前100家中96%的企业,可以发现,SAS在BI的高端市场具有绝对的领先地位。在此仅举一例如下:在以苛刻严格著称于世的美国FDA新药审批程序中,新药试验结果的统计分析规定只能用SAS进行,其他软件的计算结果一律无效。
IDC预测:亚太地区商业智能软件市场将以每年23%的速度增长,2006年将达33亿美元,是目前市场价值12亿美元的近3倍之多,而中国目前是亚太地区商业智能增长最为迅速的市场之一。近日,SAS宣布,公司2004年在全球收入较2003年上升了15%,而亚太区收入上升了20%,进一步反映了SAS在全球商业智能领域的市场份额不断提升,而在亚太区的表现尤为突出,而且SAS在亚太区的发展是方兴未艾。2004年SAS在亚太区及大中华区的重要客户包括Aeon信贷财务有限公司、澳洲税务局、香港特区政府的水务署、上海通用汽车有限公司及上海证券交易所。
SAS的中国之行
SAS公司一直非常关注中国的商业智能市场。早在1990年,SAS就在中国设立了分支机构。1997年,赛仕软件研究所正式宣布成立大中国区。1999年3月,SAS公司在中国设立独资公司——
赛仕软件(上海)有限公司,并成立了北京办事处。2003年11月,SAS的CEO Jim
Goodnight访华,并在北大作了关于BI的演讲,SAS希望能够给中国用户带来领先的智能解决方案,帮助中国业界从信息自动化过渡到信息智能化阶段。SAS在2004年末的时候,在Better Management
LIVE
2004企业领袖高峰论坛会议上,发布了SAS财务智能和企业绩效管理解决方案。应用SAS财务智能解决方案,财务部门将能够全面描绘出企业绩效管理的前景,并使自已转变为能够为整个企业提供高价值信息的可靠可信的顾问机构。SAS
财务智能解决方案采用了SAS公司最新的具有全面的财务整合、报告、规划和分析功能的SAS财务管理软件和能够帮助决策者规划、实施和调整商业策略的SAS战略绩效管理软件。
SAS的中国战略
中国企业在受到ERP、CRM和SCM的洗礼后,众多的中国企业都积累了大量的数据,而如何从海量数据中提取出智能信息来支持企业决策,则成了企业最迫切的需求。商业智能作为一种理性的经营管理决策的思想,正在为越来越多的中国企业用户所接受。SAS公司将进一步加大对中国的投入,借助其全球40,000多个商业智能项目的丰富经验与27年的行业知识,把先进的技术和成熟的行业解决方案带给中国的客户,帮助更多的中国企业实现商业智能化。在实施和应用通用的企业管理软件的基础上,对其所属行业的独特的业务知识和业务技能需要更加深入的掌握,而这些知识和技能可令企业在业内表现出色,让企业标新立异于行业,并提高竞争能力。应时而变,SAS公司带着她领先的BI技术来到了中国,同时中国企业也会更加重视商业智能领域的应用,提升企业的决策能力。
面对如此巨大的市场潜力,SAS公司将在2005年重新调整在中国市场的战略部署,进一步加大在技术研发方面的投资,并且通过整合在中国大陆、香港和台湾公司的资源为在延续SAS中国市场的成功做足准备。
神州数码说SAS太高端
1999年以来,神州数码推进了以DSS(决策支持系统)、ERP、e-Bridge三个系统为标志的数字神经网络建设,以降低成本和提高效率为目标,大力加强内部信息化和网络化建设。鉴于当时的数据量并不大,数据仓库(Data
warehouse)没有达到使用的程度,DSS可以很好的满足需求。但是,随着集团数据量成倍的增长,DSS不堪重荷,所以对BI的需求日益强烈,此后开始的大量的调研和准备。在神州数码的BI招标中,SAS、IBM和Sagent最有可能入选,而神州数码最中选择了Sagent,神州数码的理由是SAS过于高端,而IBM价格较高且同神州数码SAP
R3的数据库基础兼容性不佳,Sagent以高速处理、对于SAP
R3良好的兼容性及合适的价位获得了神州数码的定单。在充分的准备下,2002年3月,神州数码BI系统成功上线运行。
金融电信说SAS很出色
SAS软件研究所与国际最具权威的测评/评估机构Thomson
Prometric(普尔文)相互合作,为全球的金融、电信、交通、制造、政府以及科研教育等部门提供了全方位的商业软件认证平台。SAS系统在中国人民银行、中国工商银行、中国建设银行、中国农业银行、中国国家开发银行、广东发展银行、中国证监会、上海证券交易所
、上海交通银行、中国人寿保险集团公司、上海宝山钢铁集团、北京移动通讯、河北移动通讯、中国国家统计局、中国铁道部、中国海关总署、国家疾病预防控制中心、上海联通、吉林电信、中国民航管理总局信息中心、中国南方航空(集团)公司等均有出色的表现。
另外,IDC 2005年亚太区(日本除外)软件市场十大趋势预测之一,商业智能(BI)软件和企业应用软件(Enterprise
Applications ,EA)市场持续整合(converge)。IDC
也预期,2005年厂商将会推出一种满足亚太地区(日本除外)市场需要的低端商业智能解决方案。那么,SAS确实要让她的高端产品本土化才能满足中国企业的当前需求。
面对中国企业从信息自动化过渡到信息智能化的趋势,无疑,SAS将掀起中国BI的新浪潮。
SAP公司成立于1972年,总部位于德国沃尔多夫市,是全球最大的企业管理软件及协同商务解决方案供应商、全球第三大独立软件供应商。目前,全球有120多个国家的超过19,300家用户正在运行着60,100多套SAP软件。财富
500强80%以上的企业都正在从SAP的管理方案中获益。SAP在全球50多个国家拥有分支机构,并在多家证券交易所上市,包括法兰克福和纽约证交所。
SAP中国
SAP早在八十年代就同中国的国营企业合作并取得了成功经验。1994年底,SAP在北京建立了代表机构,1995年正式成立SAP中国公司,1996年、1997年陆续设立上海和广州分公司。作为中国ERP市场的绝对领导者,SAP的市场份额已经达到30%,年度业绩以50%以上的速度递增。
SAP在中国还有众多的合作伙伴,包括IBM、HP、Sun、埃森哲、毕博、德勤、凯捷安永、欧雅联盟、汉思、东软、高维信诚、联想汉普、神州数码等。SAP在众多的项目中与这些伙伴密切合作,将先进的管理理念变为现实。
SAS的使命是提供卓越的软件和服务,为用户做出正确决策提供强大动力。我们希望为您的商业决策提供最有竞争力的武器。
SAS成立于1976年,是全球最大的私人软件公司,员工近10,000人,分布在全球近200家分支机构。
在提供新一代的商业智能(Business Intelligence)软件和创造真正的企业智能(Enterprise
Intelligence)方面,SAS无疑是市场的领头人。全球有42,000多家企业,包括财富500强中90%的企业都在使用SAS商业智能解决方案。SAS商业智能解决方案主要用于与客户和供应商建立双赢的关系、快速做出明智的决策及促使企业或组织进步。SAS是将世界领先的数据仓库技术、数据分析技术和传统的商业智能(BI)应用完全整合、通过大量数据创造智能的供应商。
2003年,SAS年收入为13.4亿美元,继续保持每年收入和利润的持续增长。为支持新技术的开发,SAS还将年收入的26%投入研发,该投入是一般大型软件公司平均投入研发资金的两倍。
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/AmiRural/archive/2007/01/18/1486728.aspx
SAS 就业前景及证书考试要点解析
Posted Tuesday, May 8, 2007
新概念供稿
我们正处于一个信息爆炸的年代,能够在第一时间内获得或者找到最有价值的信息和资源,则成为企业在激烈地竞争中取胜的重要的因素,所以商业智能(Business
Intelligence)应运而生,而与之相关的技术和工具如Data Warehouse、 Data
Mining、SAS则以惊人的速度得到快速、蓬勃的发展,并且在北美以至全球都有越来越火的趋势。毫无疑问相对应的必然是需要大量的此方面的技术人员,并且由于工作性质和数据库相关,职位相对稳定、高薪,很适合华人技术移民!同时更由于BI工作多在大的银行和企业、真正会做的人少、与数据相关等优势,虽然北美就业市场竞争日益激烈,而上述数据处理系列的就业则一枝独秀。
近30年以来,SAS一直被公认为是具有行业优势的、分析标准软件的首选。SAS作为全球领先的商业智能(Business
Intelligence,简称BI) 软件供应商,始终致力于将原始数据(RAW DATA)
转变为知识和洞察,SAS商业智能软件能够帮助客户从巨大量的数据中获取智能信息,是全球唯一一家将领先的数据仓库技术、分析方法论和传统的商业智能整合在一起的端到端
(end-to-end) 的厂商,因此被誉为“世界五百强背后的管理大师”。SAS CEO Jim
Goodnight说:“在这个迅速变化的时代中,成功和失败取决于如何将信息快速地转化为知识,并依靠知识制定决策”
。另一位高级主管Cooke说:“商业智能和分析技术有助于发现新的机会,不论公司的大小,都可以有效地使用BI和分析工具,强化他们的现有资源--人员、技术、数据等等,能够在全球商业舞台上扮演指挥角色。”SAS市场主管Jim
Davis强调说:“时间是一个公平竞争的环境:每个企业,不论其大小、行业领域或地理位置,每天都只有1440分钟。那些不仅能够快速获得信息,而且有时间在制定决策之前彻底分析情况的企业将获得竞争优势。”
由此可见在商业智能呼声极度高涨的今天,能够为客户真正创造价值的解决方案才能够获得认可。商业智能软件不同于ERP,它不是流程管理软件。据数据调查和预测,BI市场将以27%的年平均增长率发展,亚太地区BI软件市场将以每年23%的速度增长。这些数字无疑给众多BI厂商打了一针强心剂。
商务智能(BI)是什么?BI实际上是帮助企业提高决策能力和运营能力的概念、方法、过程以及软件的集合,是通过收集、存储、挖掘和分析数据,为决策者提供相应的决策依据,其主要目标是将企业所掌握的信息转换成竞争优势。更直接了当地说,BI帮助你从业务数据中提取有用的信息,然后采取明智的行动,从而告别“拍脑袋”决策。我们看一个典型的案例:美国沃尔玛(Wal-Mart)
公司的分店经理发现:一段时期以来,每逢周末店内啤酒和尿布的销量都会同比攀升。这看似毫不相关的两种商品,销量之间为什么会出现如此相似的波动?其中有什么关联吗?后来,通过运用SAS分析,发现购买这两种产品的顾客几乎都是
25 岁到 35
岁、家有婴儿的男性,每次购买时间均在周末。分析还发现:原来这些人习惯晚上边看球赛、边喝啤酒,对于要照顾的孩子,为了图省事就用一次性尿布。于是Wal-Mart
决定:把这两种商品集中摆在一起销售, 结果销量显著增加。
商务智能可以在以下几个方面发挥作用:
第一,
理解业务。商务智能是用来帮助理解业务的推动力量,认识数据(DATA)的趋势、非正常Pattern和特征,分析它们对业务产生的影响。
第二,
客户分类和特点分析。根据客户历年来的大量消费记录以及客户的档案资料,对客户进行分类,并分析每类客户的消费能力、消费习惯、消费周期、需求倾向、信誉度。确定哪类顾客给企业带来最大的利润、哪类顾客仅给企业带来最少的利润同时又要求最多的回报,然后针对不同类型的客户给予不同的服务及优惠。
第三,改善关系。商务智能能为顾客、员工、供应商、股东和大众提供关于企业及其业务状况的有用信息,从而提高企业的知名度、增强整个信息链的一致性。利用商务智能,企业可以在问题变成危机之前很快地对它们加以识别并解决。商务智能也有助于加强顾客忠诚度,一个参与其中并掌握充分信息的顾客更加有可能购买你的产品和服务。
第四,
市场营销策略分析。利用数据仓库技术实现市场营销策略在模型上的仿真,其仿真结果将提示所制定的市场营销策略是否合适,企业可以据此调整和优化其市场营销策略,使其获得最大的成功。
第五,经营成本与收入分析。对各种类型的经济活动进行成本核算,比较可能的业务收入与各种费用之间的收支差额,分析经济活动的曲线,得到相应的改进措施和办法,从而降低成本、减少开支、和提高收入。
第六,欺诈行为分析和预防。利用数据挖掘技术,总结各种骗费、欠费行为的内在规律后,就可以及时预警各种骗费、欠费,尽量减少企业损失。位于在信息化三个层次中的最顶层,注定了BI生长在高端。从数据到信息,信息到知识,知识到决策,决策到财富的流程中,注定了BI会让厂商,
企业和将要踏上SAS 航程的你在财富面前同时笑得非常灿烂。
SAS公司的目标是:继续巩固在市场的领导地位,持续强劲的两位数增长;在市场占有、核心技术、商务解决方案等诸方面都要打造业界最好品牌。由此,可以深信SAS建立在科学的数据分析和方法论基础上的“直觉”一定能够让商家从激烈的竞争中脱颖而出。SAS
已广泛应用于金融、保险,、制药、公共卫生、 流性病预防、电信、交通、 海关、政府、大学及研究所、 市场调研、
农业、制造业等领域。亦是制药行业为开发和评估药物提供统计分析唯一制定的商用软件。那么,什么是SAS (Statistical Analytical
System)?SAS系统是由SAS公司开发的集数据仓库、大规模数据处理、数据挖掘、统计分析、图表制作、网页连接等为一体的计算机软体系统。美国SAS软件研究所(SAS
Institute
Inc.)创建于1976年,总部在美国北卡州。SAS公司旨在为用户提供最富有竞争力的武器,让用户为自己的商务发展做出最正确、最有效的决策,即将杂乱无章的原始数据转化为富有价值的信息和知识,令用户的商务在竞争的浪潮中,永远把握正确航向和速度发展、延伸。SAS语言本身是一种非过程语言(第四代语言),类似于C语言,且综合了各种高级语言的功能和灵活的格式,将数据处理和统计分析融合于一体。V8采用C++语言支持,而新版的V9将有JAVA支持的版本,实现数据网上连接、读取、处理、分析、表达等全过程。
关于“SAS是不是好学”的问题,从我的经历来看,
上手容易,只要有逻辑思维的能力就可以编程。SAS系统有其独特的编程步骤和语言,其最大的特点是:简单、易学,语句的针对性强,依赖仅有的DATA STEP和PROC
STEP,灵活的语句和步骤组合,即可解决从数据读取、处理、分析、表达、连接中的任何简单或复杂的问题。SAS的编程语言是最令人赞赏的,而不象一些计算机语言会出现过时,甚至被淘汰。SAS具备“两强”的特点,就是分析强,处理强。SAS在向“合成器”这个方向上努力,做成
‘傻瓜’和专家都适用的语言。所以,只要有适合你的学习环境并且获得有效的指导,掌握扎实的SAS编程技术和技巧不是梦,而真正事倍功半的境界需要BASE &
ADVANCED证书课程培训及项目的训练。事实亦证明,学员通过证书考试几乎100%,而且部分学员能取得90分以上的好成绩。藉着项目训练,学员能在SAS技能、专业简历、面试等都得到了长足的进步和发展,他们的求职成功也一直激励着我们更好、更高标准的满足SAS求职者的需求。
SAS公司目前提供五种专业证书即SAS Certified Base
Programmer,SAS Certified Advanced Programmer,
SAS Certified WebAF Developer: Server-side Credential,SAS
Certified Warehouse
Development Specialist Credential and SAS Certified
Warehouse Architect Credential。
SAS五级全球认证为递进式认证体系:即只有通过上一级认证考试才有资格参加下一级认证考试。例如,如果要参加SAS
Certified Advanced Programmer,必须首先获得SAS Certified Base
Programmer认证。SAS认证的有效期:目前SAS五级认证没有特定有效期,但是时间太久或版本太老的认证证书会有所贬值。目前全球仅有3000多人通过SAS全球专业认证,在欧美等发达国
家,获得SAS认证并有丰富经验的人才在职场上供不应求,因此,在欧美等国的职场上流行一句话,“If you have a
SAS certification, you will never lose your job”。目前谁能抓住这个机会
尽早考取SAS认证,积累起丰富的SAS应用经验,谁就能在未来的职场拼杀中独占鳌头,笑傲群雄。
5/12 六 4:30-6:00pm SAS
就业前景及证书考试要点分析
士嘉堡校址:2175 Sheppard Ave East,
Suite 108(Sheppard/Victoria Park,M2J 1W8),近高速、地铁站、TTC,大量免费停车位! Code:
#0049
数据挖掘前景与现状 |
2008年08月14日 16:13 来源:成功职业指导中心 |
职业介绍 数据挖掘(Data Mining)就是从大量数据中发现潜在规律、提取有用知识的方法和技术。因为与数据库密切相关,又称为数据库知识发现(Knowledge Discovery in
Databases,KDD) ,就是将高级智能计算技术应用于大量数据中,让计算机在有人或无人指导的情况下从海量数据中发现潜在的,有用的模式(也叫知识)。 广义上说,任何从数据库中挖掘信息的过程都叫做数据挖掘。从这点看来,数据挖掘就是BI(商业智能)。但从技术术语上说,数据挖掘(Data
Mining)特指的是:源数据经过清洗和转换等成为适合于挖掘的数据集。数据挖掘在这种具有固定形式的数据集上完成知识的提炼,最后以合适的知识模式用于进一步分析决策工作。从这种狭义的观点上,我们可以定义:数据挖掘是从特定形式的数据集中提炼知识的过程。数据挖掘往往针对特定的数据、特定的问题,选择一种或者多种挖掘算法,找到数据下面隐藏的规律,这些规律往往被用来预测、支持决策。 数据挖掘的主要功能 1. 分类:按照分析对象的属性、特征,建立不同的组类来描述事物。例如:银行部门根据以前的数据将客户分成了不同的类别,现在就可以根据这些来区分新申请贷款的客户,以采取相应的贷款方案。 2. 聚类:识别出分析对内在的规则,按照这些规则把对象分成若干类。例如:将申请人分为高度风险申请者,中度风险申请者,低度风险申请者。 3. 关联规则和序列模式的发现:关联是某种事物发生时其他事物会发生的这样一种联系。例如:每天购买啤酒的人也有可能购买香烟,比重有多大,可以通过关联的支持度和可信度来描述。与关联不同,序列是一种纵向的联系。例如:今天银行调整利率,明天股市的变化。 4. 预测:把握分析对象发展的规律,对未来的趋势做出预见。例如:对未来经济发展的判断。 5. 偏差的检测:对分析对象的少数的、极端的特例的描述,揭示内在的原因。例如:在银行的100万笔交易中有500例的欺诈行为,银行为了稳健经营,就要发现这500例的内在因素,减小以后经营的风险。 需要注意的是:数据挖掘的各项功能不是独立存在的,在数据挖掘中互相联系,发挥作用。 数据挖掘的方法及工具 作为一门处理数据的新兴技术,数据挖掘有许多的新特征。首先,数据挖掘面对的是海量的数据,这也是数据挖掘产生的原因。其次,数据可能是不完全的、有噪声的、随机的,有复杂的数据结构,维数大。最后,数据挖掘是许多学科的交叉,运用了统计学,计算机,数学等学科的技术。以下是常见和应用最广泛的算法和模型: (1) 传统统计方法:① 抽样技术:我们面对的是大量的数据,对所有的数据进行分析是不可能的也是没有必要的,就要在理论的指导下进行合理的抽样。② 多元统计分析:因子分析,聚类分析等。③ 统计预测方法,如回归分析,时间序列分析等。 (2) 可视化技术:用图表等方式把数据特征用直观地表述出来,如直方图等,这其中运用的许多描述统计的方法。可视化技术面对的一个难题是高维数据的可视化。 职业能力要求 基本能力要求 数据挖掘人员需具备以下基本条件,才可以完成数据挖掘项目中的相关任务。 一、专业技能 硕士以上学历,数据挖掘、统计学、数据库相关专业,熟练掌握关系数据库技术,具有数据库系统开发经验 熟练掌握常用的数据挖掘算法 具备数理统计理论基础,并熟悉常用的统计工具软件 二、行业知识 具有相关的行业知识,或者能够很快熟悉相关的行业知识 三、合作精神 具有良好的团队合作精神,能够主动和项目中其他成员紧密合作 四、客户关系能力 具有良好的客户沟通能力,能够明确阐述数据挖掘项目的重点和难点,善于调整客户对数据挖掘的误解和过高期望 具有良好的知识转移能力,能够尽快地让模型维护人员了解并掌握数据挖掘方法论及建模实施能力 进阶能力要求 数据挖掘人员具备如下条件,可以提高数据挖掘项目的实施效率,缩短项目周期。 具有数据仓库项目实施经验,熟悉数据仓库技术及方法论 熟练掌握SQL语言,包括复杂查询、性能调优 熟练掌握ETL开发工具和技术 熟练掌握Microsoft Office软件,包括Excel和PowerPoint中的各种统计图形技术 善于将挖掘结果和客户的业务管理相结合,根据数据挖掘的成果向客户提供有价值的可行性操作方案 应用及就业领域 当前数据挖掘应用主要集中在电信(客户分析),零售(销售预测),农业(行业数据预测),网络日志(网页定制),银行(客户欺诈),电力(客户呼叫),生物(基因),天体(星体分类),化工,医药等方面。当前它能解决的问题典型在于:数据库营销(Database Marketing)、客户群体划分(Customer Segmentation & Classification)、背景分析(Profile Analysis)、交叉销售(Cross-selling)等市场分析行为,以及客户流失性分析(Churn Analysis)、客户信用记分(Credit Scoring)、欺诈发现(Fraud Detection)等等,在许多领域得到了成功的应用。如果你访问著名的亚马逊网上书店(www.amazon.com),会发现当你选中一本书后,会出现相关的推荐数目“Customers who bought this book also bought”,这背后就是数据挖掘技术在发挥作用。 数据挖掘的对象是某一专业领域中积累的数据;挖掘过程是一个人机交互、多次反复的过程;挖掘的结果要应用于该专业。因此数据挖掘的整个过程都离不开应用领域的专业知识。“Business First, technique second”是数据挖掘的特点。因此学习数据挖掘不意味着丢弃原有专业知识和经验。相反,有其它行业背景是从事数据挖掘的一大优势。如有销售,财务,机械,制造,call center等工作经验的,通过学习数据挖掘,可以提升个人职业层次,在不改变原专业的情况下,从原来的事务型角色向分析型角色转变。从80年代末的初露头角到90年代末的广泛应用,以数据挖掘为核心的商业智能(BI)已经成为IT及其它行业中的一个新宠。 数据采集分析专员 职位介绍:数据采集分析专员的主要职责是把公司运营的数据收集起来,再从中挖掘出规律性的信息来指导公司的战略方向。这个职位常被忽略,但相当重要。由于数据库技术最先出现于计算机领域,同时计算机数据库具有海量存储、查找迅速、分析半自动化等特点,数据采集分析专员最先出现于计算机行业,后来随着计算机应用的普及扩展到了各个行业。该职位一般提供给懂数据库应用和具有一定统计分析能力的人。有计算机特长的统计专业人员,或学过数据挖掘的计算机专业人员都可以胜任此工作,不过最好能够对所在行业的市场情况具有一定的了解。 求职建议:由于很多公司追求短期利益而不注重长期战略的现状,目前国内很多企业对此职位的重视程度不够。但大型公司、外企对此职位的重视程度较高,随着时间的推移该职位会有升温的趋势。另外,数据采集分析专员很容易获得行业经验,他们在分析过程中能够很轻易地把握该行业的市场情况、客户习惯、渠道分布等关键情况,因此如果想在某行创业,从数据采集分析专员干起是一个不错的选择。 市场/数据分析师 1. 市场数据分析是现代市场营销科学必不可少的关键环节: Marketing/Data Analyst从业最多的行业: Direct Marketing (直接面向客户的市场营销) 吧,自90年代以来, Direct Marketing越来越成为公司推销其产品的主要手段。根据加拿大市场营销组织(Canadian Marketing Association)的统计数据: 仅1999年一年 Direct Marketing就创造了470000 个工作机会。从1999至2000,工作职位又增加了30000个。为什么Direct Marketing需要这么多Analyst呢? 举个例子, 随着商业竞争日益加剧,公司希望能最大限度的从广告中得到销售回报, 他们希望能有更多的用户来响应他们的广告。所以他们就必需要在投放广告之前做大量的市场分析工作。例如,根据自己的产品结合目标市场顾客的家庭收入,教育背景和消费趋向分析出哪些地区的住户或居民最有可能响应公司的销售广告,购买自己的产品或成为客户,从而广告只针对这些特定的客户群。这样有的放矢的筛选广告的投放市场既节省开销又提高了销售回报率。但是所有的这些分析都是基于数据库,通过数据处理,挖掘,建模得出的,其间,市场分析师的工作是必不可少的。 2. 行业适应性强: 几乎所有的行业都会应用到数据, 所以作为一名数据/市场分析师不仅仅可以在华人传统的IT行业就业,也可以在政府,银行,零售,医药业,制造业和交通传输等领域服务。 现状与前景 数据挖掘是适应信息社会从海量的数据库中提取信息的需要而产生的新学科。它是统计学、机器学习、数据库、模式识别、人工智能等学科的交叉。在中国各重点院校中都已经开了数据挖掘的课程或研究课题。比较著名的有中科院计算所、复旦大学、清华大学等。另外,政府机构和大型企业也开始重视这个领域。 据IDC对欧洲和北美62家采用了商务智能技术的企业的调查分析发现,这些企业的3年平均投资回报率为401%,其中25%的企业的投资回报率超过600%。调查结果还显示,一个企业要想在复杂的环境中获得成功,高层管理者必须能够控制极其复杂的商业结构,若没有详实的事实和数据支持,是很难办到的。因此,随着数据挖掘技术的不断改进和日益成熟,它必将被更多的用户采用,使更多的管理者得到更多的商务智能。 根据IDC(International Data Corporation)预测说2004年估计BI行业市场在140亿美元。现在,随着我国加入WTO,我国在许多领域,如金融、保险等领域将逐步对外开放,这就意味着许多企业将面临来自国际大型跨国公司的巨大竞争压力。国外发达国家各种企业采用商务智能的水平已经远远超过了我国。美国Palo Alto 管理集团公司1999年对欧洲、北美和日本375家大中型企业的商务智能技术的采用情况进行了调查。结果显示,在金融领域,商务智能技术的应用水平已经达到或接近70%,在营销领域也达到50%,并且在未来的3年中,各个应用领域对该技术的采纳水平都将提高约50%。 现在,许多企业都把数据看成宝贵的财富,纷纷利用商务智能发现其中隐藏的信息,借此获得巨额的回报。国内暂时还没有官方关于数据挖掘行业本身的市场统计分析报告,但是国内数据挖掘在各个行业都有一定的研究。据国外专家预测,在今后的5—10年内,随着数据量的日益积累以及计算机的广泛应用,数据挖掘将在中国形成一个产业。 众所周知,IT就业市场竞争已经相当激烈,而数据处理的核心技术---数据挖掘更是得到了前所未有的重视。数据挖掘和商业智能技术位于整个企业IT-业务构架的金字塔塔尖,目前国内数据挖掘专业的人才培养体系尚不健全,人才市场上精通数据挖掘技术、商业智能的供应量极小,而另一方面企业、政府机构和和科研单位对此类人才的潜在需求量极大,供需缺口极大。如果能将数据挖掘技术与个人已有专业知识相结合,您必将开辟职业生涯的新天地! 职业薪酬 就目前来看,和大多IT业的职位一样,数据仓库和数据挖掘方面的人才在国内的需求工作也是低端饱和,高端紧缺,在二线成熟,高端数据仓库和数据挖掘方面的人才尤其稀少。高端数据仓库和数据挖掘人才需要熟悉多个行业,至少有3年以上大型DWH和BI经验,英语读写流利,具有项目推动能力,这样的人才年薪能达到20万以上。 职业认证 1、SAS认证的应用行业及职业前景 SAS全球专业认证是国际上公认的数据挖掘和商业智能领域的权威认证,随着我国IT环境和应用的日渐成熟,以上两个领域将有极大的行业发展空间。获取SAS全球专业认证,为您在数据挖掘、分析方法论领域积累丰富经验奠定良好的基础,帮助您开辟职业发展的新天地。 2、SAS认证的有效期 目前SAS五级认证没有特定有效期,但是时间太久或版本太老的认证证书会有所贬值。 3、五级认证的关系 五级认证为递进式关系,即只有通过上一级考试科目才能参加下一级认证考试。 4、SAS全球认证的考试方式 考试为上机考试,时间2个小时,共70道客观题。 相关链接 随着中国物流行业的整体快速发展,物流信息化建设也取得一定进展。无论在IT硬件市场、软件市场还是信息服务市场,物流行业都具有了一定的投资规模,近两年的总投资额均在20-30亿元之间。政府对现代物流业发展的积极支持、物流市场竞争的加剧等因素有力地促进了物流信息化建设的稳步发展。 易观国际最新报告《中国物流行业信息化年度综合报告2006》中指出,中国物流业正在从传统模式向现代模式实现整体转变,现代物流模式将引导物流业信息化需求,而产生这种转变的基本动力来自市场需求。报告中的数据显示:2006-2010年,传统物流企业IT投入规模将累计超过100亿元人民币。2006-2010年,第三方物流企业IT投入规模将累计超过20亿元人民币。 由于目前行业应用软件系统在作业层面对终端设备的硬件提出的应用要求较高,而软件与硬件的集成性普遍不理想,对应性单一,因此企业将对软件硬件设备的集成提出更高要求。 物流行业软件系统研发将更多的考虑运筹学与数据挖掘技术,专业的服务商将更有利于帮助解决研发问题。 物流科学的理论基础来源于运筹学,并且非常强调在繁杂的数据处理中找到关联关系(基于成本-服务水平体系),因此数据挖掘技术对于相关的软件系统显得更为重。 |
从IBM收购SPSS看数据挖掘的未来
责任编辑:晓熊作者:IT168 黄永兵 编译 2009-08-14
【IT168
技术评论】IBM最近宣布收购专业统计分析公司SPSS,IBM的这一举动不仅仅是为了完善其统计分析产品结构,更是看好数据挖掘这片市场,数据挖掘正好是我关注的焦点。
图:数据挖掘 工具成为重点
要从两方面来分析:首先要为用户提供真正可用有效的数据挖掘工具,第二是在数据库内进行数据挖掘。我想先谈谈第二个问题。
传统上,为了执行数据挖掘操作,需要从数据仓库或数据集市中提取数据到数据挖掘工具由你处理。这样做有一个明显的性能问题,要想提取所有数据并不是一件容易的事。当然你可以通过数据
抽样降低影响,但准确性也随之降低了,最近的发展趋势是直接在数据库中执行数据挖掘。我们可以做个大胆的预测,IBM在完成对SPSS的收购之后第一个宣布的公告肯定是DB2将可以实现SPSS
统计功能。
其它公司都不具有这种能力,有一个例外是Teradata和SAS,Netezza已经在数据库中实现了SAS功能,另一个值得赞扬的是,Netezza去年收购了
NuTech,NuTech是一种可以建立预测应用数据挖掘的工具。
所有这一切都意味着大部分数据仓库都将完全依赖SAS,但这些数据仓库厂商说不服SAS直接包含在其数据库中,这对于Tibco是一个好消息。
从历史来看,只有一个主要的数据挖掘厂商(SAS),一个中间角色(SPSS)和一些小虾米,如Angoss,Kxen和Insightful,但种种迹象表明这些小虾米还没有能力打破由SAS和SPSS制定的游戏规则。
但去年Tibco收购了Spotfire之后紧接着又收购了Insightful,由此可见Tibco想证明自己还是有实力的,这个公司还拥有领先的事件处理引擎,我非常希望有SAS竞争对手成长起来,有竞争才会促进大发展。
2009/2/17
fyi: 数据挖掘软件大评比
去年年底(2008年11月),德国一家技术咨询公司,mayato,发布了一篇数据挖掘挖掘软件的评估报告,考察了以下12种产品:
传统的数据挖掘套件(Classic suites):SAS Enterprise Miner 5.3
SPSS Clementine 12
SPSS Clementine 12
开源数据挖掘软件(Open
Source):
RapidMiner 4.2
KNIME 1.3.5
Weka 3.4.13
RapidMiner 4.2
KNIME 1.3.5
Weka 3.4.13
自动化数据挖掘软件(Self-Acting):
KXEN Analytic Framework 4.04
KXEN Analytic Framework 4.04
专门化的数据挖掘软件(Specialized):
Viscovery SOMiner 5.0
prudsys Discovery 5.5 / Basket Analyzer 5.2
Bissantz Delta Master 5.3.6
Viscovery SOMiner 5.0
prudsys Discovery 5.5 / Basket Analyzer 5.2
Bissantz Delta Master 5.3.6
BI产品内置的数据挖掘软件(BI Vendors):
SAP NetWear 7.0 Data Mining WorkbenchOracle 11g Data Mining
Microsoft SQL Server 2005 Analysis Services
SAP NetWear 7.0 Data Mining WorkbenchOracle 11g Data Mining
Microsoft SQL Server 2005 Analysis Services
最近数据挖掘市场呈现出多元化的态势,除了传统的数据挖掘厂商SAS、SPSS的产品外,又有各种专有用途的数据挖掘软件,加上开源软件和BI产商提供的数据挖掘功能,这个市场看着是红红火火,给
各种层次的用户提供了灵活的选择空间。mayato这篇报告的题目就叫做
Data
Mining Software 2009: Successful Analyses at Affordable Prices
(November 2008)。
可惜mayato这次的评估不够深入,所用的标准也嫌太过粗糙。在对Enterprise Miner (SAS), Rapidminer
(Rapid-I), Analytic Framework (KXEN), and NetWeaver Data Mining Workbench
(SAP)这四种产品
进行了所谓thoroughly的评估后,它的结果是,Analytic Framework
(KXEN)综合排名第一,Enterprise Miner (SAS)紧接其后,然后是SAP NetWeaver Data
Mining
Workbench和Rapidminer。
这次KXEN排名第一,我们并不感到吃惊(mayato是KXEN的合作伙伴)。不过,KXEN处理数据的速度的确非常值得称道。KXEN号称自动化数据挖掘软件(Self-Acting),
客户定制调优的空间比较小,在默认选项下,其他软件在运行速度上就吃亏不少了。最后提一下,这次评估,KXEN在速度上占优,而SAS在性能上最为突出。
总的来说,这份评估过于简单,只能作为参考。不过它所展现的(以及遗漏的)数据挖掘软件市场,还是让人比较兴奋的(这次评估,当然有非常多的遗漏,如重要的Teradata
Warehouse Miner、
IBM的DB2 Intelligence
Miner、Angoss、Unica等)。
谁能说说spss,matlab,sas,excel在统计应用方面的区别
excel也能作统计分析,他的分析功能与matlab和sas有何区别
答复 共 1
条
个人认为些都是数据处理应用的软件,其中excel界面最为友好,但功能是在太过单一,仅适用于日常的简单数据处理,不适于较复杂的模型分析,因此科研上应用不多;matlab采用
图形界面,功能比较强大,目前研究中应用最广;spss和sas都有比较强的专业性,前者主要用于社科类研究,后者主要用于自然科学
及经济的研究方面,另外spss也采用图形界面,友好性方面要强于全部由编程语言进行操作的sas,
但spss的主要缺点是数据输出,不能用word等文字处理工具直接打开。 以下是我找到的一些资料,比较详细,楼主可以参考。 ************************************** MATLAB 的名称源自 Matrix Laboratory ,它是一种科学计算软件,专门以矩阵的形式处理数据。 MATLAB 将高性能的数值计算和可视化集成在一起,并提供了大量的内置函数,
从而被广泛地应用于科学计算、控制系统、信息处理等领域的分析、仿真和设计工作,而且利用 MATLAB 产品的开放式结构,可以非常容易地对 MATLAB 的功能进行扩充,从而在不断
深化对问题认识的同时,不断完善 MATLAB 产品以提高产品自身的竞争能力。
目前 MATLAB 产品族可以用来进行:
数值分析
数值和符号计算
工程与科学绘图
控制系统的设计与方针
数字图像处理
数字信号处理
通讯系统设计与仿真
财务与金融工程 MATLAB 是 MATLAB 产品家族的基础,它提供了基本的数学算法,例如矩阵运算、数值分析算法, MATLAB 集成了 2D 和 3D 图形功能,以完成相应数值可视化的工作,并且提供了一种
交互式的高级编程语言—— M
语言,利用 M 语言可以通过编写脚本或者函数文件实现用户自己的算法。
MATLAB Compiler 是一种编译工具,它能够将那些利用 MATLAB 提供的编程语言—— M 语言编写的函数文件编译生成为函数库、可执行文件 COM 组件等等。这样就可以扩展
MATLAB Compiler 是一种编译工具,它能够将那些利用 MATLAB 提供的编程语言—— M 语言编写的函数文件编译生成为函数库、可执行文件 COM 组件等等。这样就可以扩展
MATLAB 功能,使 MATLAB 能够同其他高级编程语言例如 C/C++ 语言进行混合应用,取长补短,以提高程序的运行效率,丰富程序开发的手段。 利用 M 语言还开发了相应的 MATLAB 专业工具箱函数供用户直接使用。这些工具箱应用的算法是开放的可扩展的,用户不仅可以查看其中的算法,还可以针对一些算法进行修改,甚至允许开发自己的算法扩充工具箱的功能。目前 MATLAB 产品的工具箱有四十多个,分别涵盖了数据获取、科学计算、控制系统设计与分析、数字信号处理、数字图像处理、
金融财务分析以及生物遗传工程等专业领域。 Simulink 是基于 MATLAB 的框图设计环境,可以用来对各种动态系统进行建模、分析和仿真,它的建模范围广泛,可以针对任何能够用数学来描述的系统进行建模,例如航空航天动力学系统、卫星控制制导系统、通讯系统、船舶及汽车等等,其中了包括连续、离散,条件执行,事件驱动,单速率、多速率和混杂系统等等。 Simulink 提供了利用鼠标拖放的方法
建立系统框图模型的图形界面,而且 Simulink 还提供了丰富的功能块以及不同的专业模块集合,利用 Simulink 几乎可以做到不书写一行代码完成整个动态系统的建模工作。 Stateflow 是一个交互式的设计工具,它基于有限状态机的理论,可以用来对复杂的事件驱动系统进行建模和仿真。 Stateflow 与 Simulink 和 MATLAB 紧密集成,可以将
Stateflow 创建的复杂控制逻辑有效地结合到
Simulink 的模型中。
在 MATLAB 产品族中,自动化的代码生成工具主要有 Real-Time Workshop ( RTW )和 Stateflow
在 MATLAB 产品族中,自动化的代码生成工具主要有 Real-Time Workshop ( RTW )和 Stateflow
Coder ,这两种代码生成工具可以直接将 Simulink 的模型框图和 Stateflow 的状态图
转换成高效优化的程序代码。利用 RTW 生成的代码简洁、可靠、易读。目前
RTW 支持生成标准的 C
语言代码,并且具备了生成其他语言代码的能力。
整个代码的生成、编译以及相应的目标下载过程都是自动完成的,用户需要做得仅仅使用鼠标点击几个按钮即可。 MathWorks 公司针对不同的实时或非实时操作系统平台,开发了
相应的目标选项,配合不同的软硬件系统,可以完成快速控制原型( Rapid
Control Prototype )开发、硬件在回路的实时仿真( Hardware-in-Loop
)、产品代码生成等工作。
另外, MATLAB 开放性的可扩充体系允许用户开发自定义的系统目标,利用 Real-Time Workshop Embedded Coder 能够直接将 Simulink 的模型转变成效率优化的产品级代码。
代码不仅可以是浮点的,还可以是定点的。
MATLAB 开放的产品体系使 MATLAB 成为了诸多领域的开发首选软件,并且, MATLAB 还具有 300 余家第三方合作伙伴,分布在科学计算、机械动力、
MATLAB 开放的产品体系使 MATLAB 成为了诸多领域的开发首选软件,并且, MATLAB 还具有 300 余家第三方合作伙伴,分布在科学计算、机械动力、
化工、计算机通讯、
汽车、金融等领域。接口方式包括了联合建模、数据共享、开发流程衔接等等。 MATLAB 结合第三方软硬件产品组成了在不同领域内的完整解决方案,实现了从算法开发到实时仿真再到代码生成与最终产品实现的完整过程主要的典型应用包括:
控制系统的应用与开发——快速控制原型与硬件在回路仿真的统一平台 dSPACE
信号处理系统的设计与开发——全系统仿真与快速原型验证, TI DSP 、 Lyrtech 等信号处理产品软硬件平台
通信系统设计与开发——结合 RadioLab 3G 和 Candence 等产品
机电一体化设计与开发——全系统的联合仿真,结合 Easy 5 、 Adams 等
***************************************
本人用得比较多的模块式:base, stat,insight, em, ets模块(针对8.2,9的liscense不全,没用过),逐一点评:
base:功能强大,sas之本。对于一个真正的高手而言,base+iml就可以实现绝大多数功能了,他在数据管理和数据前期处理方面的强大性能是我非常喜爱的。当初处理+分析100万条通话
记录的时候,手头也只有sas能够胜任,excel的六万多条限制和spss奇慢无比的速度实在受不了。宏也是非常得好用,几千个数据文件的导入拆分只需点击一下run。还有逻辑库的设定实在是非常的方便。另外base proc sql比MS SQL Server跑Sql还要快,可见sas底层做的技术之好!
stat:统计模块,够用就好。常用的统计功能在这里都能实现了,基本相当于spss了(速度远过,常用功能略逊,特殊功能较多)。Analyst是它的可视化界面,虽然方便,
但是省略了
太多的功能,连因子分析都不能做,不得不说遗憾。当然了,对于大多数用户来说,恐怕连因子分析和主成分分析都分不清楚,还是不要让他们做得好! insight:方便灵活。看名字就知道,互动式数据分析,最适合用来观察数据,探索性数据分析,非常得方便,缺点是结果好像无法保存。 em:强大,美观,昂贵!数据挖掘模块,一年的租借费用据说是$1million,乖乖!不过实在是sas的巅峰之作,目前最优秀的数据挖掘软件!Insight也被集成为子模块之一。 ets:终于到时间序列了,比Eviews要强大,不过可惜绝大多数的功能要编程,菜单可以做一部分,太少的一部分。 iml:好东西,用于矩阵运算,可以当个matlab用,正准备学...... 总而言之:sas是给懂得人用的,界面不友好,大多数要编程,这正是sas公司的良苦用心!统计不是那么容易玩的,很多人模型前提假设都没搞懂,就在那里瞎做,用excel,spss还容
易,sas就难了。 ******************************************* SPSS是软件英文名称的首字母缩写,原意为Statistical Package for the Social Sciences,即“社会科学统计软件包”。但是随着SPSS产品服务领域的扩大和服务深度的增加,
SPSS公司已于2000年正式将英文全称更改为Statistical Product and Service Solutions,意为“统计产品与服务解决方案”,标志着SPSS的战略方向正在做出重大调整。 SPSS是世界上最早的统计分析软件,由美国斯坦福大学的三位研究生于20世纪60年代末研制,同时成立了SPSS公司,并于1975年在芝加哥组建了SPSS总部。1984年SPSS总部首先推出了
世界上第一个统计分析软件微机版本SPSS/PC+,开创了SPSS微机系列产品的开发方向,极大地扩充了它的应用范围,并使其能很快地应用于自然科学、技术科学、社会科学的各个领域,
世界上许多有影响的报刊杂志纷纷就SPSS的自动统计绘图、数据的深入分析、使用方便、功能齐全等方面给予了高度的评价与称赞。迄今SPSS软件已有30余年的成长历史。全球约有25万
家产品用户,它们分布于通讯、医疗、银行、证券、保险、制造、商业、市场研究、科研教育等多个领域和行业,是世界上应用最广泛的专业统计软
件。
SPSS是世界上最早采用图形菜单驱动界面的统计软件,它最突出的特点就是操作界面极为友好,输出结果美观漂亮。它将几乎所有的功能都以统一、规范的界面展现出来,
使用Windows的窗口方式展示各种管理和分析数据方法的功能,对话框展示出各种功能选择项。用户只要掌握一定的Windows操作技能,粗通统计分析原理,就可以使用该软件为特定的科
研工作服务。是非专业统计人员的首选统计软件。在众多用户对国际常用统计软件SAS、BMDP、GLIM、GENSTAT、EPILOG、MiniTab的总体印象分的统计中,其诸项功能均获得最高分。
SPSS采用类似EXCEL表格的方式输入与管理数据,数据接口较为通用,能方便的从其他数据库中读入数据。其统计过程包括了常用的、较为成熟的统计过程,完全可以满足非统计专业人士
的工作需要。
输出结果十分美观,存储时则是专用的SPO格式,可以转存为HTML格式和文本格式。对于熟悉老版本编程运行方式的用户,SPSS还特别设计了语法生成窗口,用户只需在菜单中选好各个选
项,然后按“粘贴”按钮就可以自动生成标准的SPSS程序。极大的方便了中、高级用户。 SPSS输出结果虽然漂亮,但不能为WORD等常用文字处理软件直接打开,只能采用拷贝、粘贴的方式加以交互。这可以说是SPSS软件的缺陷。
AI and Social Science - Brendan O’Connor
Cognition, systems, decisions, visualization, machine
learning, etc.
About
This is a blog on artificial intelligence and social science —
call it "Social Science++" — with an emphasis on computation and statistics. My
general website is anyall.org.
All Posts
Best posts are bold.- An ML/AI approach to P != NP
- Updates: CMU, Facebook
- quick note: cer et al 2010
- How Facebook privacy failed me
- List of probabilistic model mini-language toolkits
- Seeing how “art” and “pharmaceuticals” are linguistically similar in web text
- Quiz: “art” and “pharmaceuticals”
- Don’t MAWK AWK - the fastest and most elegant big data munging language!
- Patches to Rainbow, the old text classifier that won’t go away
- Another R flashmob today
- Beautiful Data book chapter
- Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features
- Blogger to Wordpress migration helper
- R questions on StackOverflow
- FFT: Friedman + Fortran + Tricks
- Road in a forest
- Beta conjugate explorer
- Michael Jackson in Persepolis
- Psychometrics quote
- June 4
- Where tweets get sent from
- Zipf’s law and world city populations
- Announcing TweetMotif for summarizing twitter topics
- Performance comparison: key/value stores for language model counts
- 1 billion web page dataset from CMU
- Pirates killed by President
- Binary classification evaluation in R via ROCR
- Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
- La Jetee
- “Logic Bomb”
- SF conference for data mining mercenaries
- Love it and hate it, R has come of age
- Facebook sentiment mining predicts presidential polls
- Can social media prevent genocide?
- Statistics vs. Machine Learning, fight!
- Calculating running variance in Python and C++
- Python bindings to Google’s “AJAX” Search API
- Netflix Prize
- The Wire: Mr. Nugget
- Correlations - cotton picking vs. 2008 Presidential votes
- Disease tracking with web queries and social messaging (Google, Twitter, Facebook…)
- Obama street celebrations in San Francisco
- Twitter graphs of the debate
- Is religion the opiate of the elite?
- Financial market theory on the Daily Show
- The Universal Declaration of Human Rights Animated
- It is accurate to determine a blog’s bias by what it links to
- Blog move has landed
- MyDebates.org, online polling, and potentially the coolest question corpus ever
- PalinSpeak.com
- "Machine" translation/vision (Stanford AI courses online)
- Fukuyama: Authoritarianism is still against history
- A better Obama vs McCain poll aggregation
- East vs West cultural psychology!
- The MacGyver of data analysis
- Link: Today’s international organizations
- Bias correction sneak peek!
- Turker classifiers and binary classification threshold calibration
- Pairwise comparisons for relevance evaluation
- Clinton-Obama support visualization
- Sub-reddit for Systems Science and OR
- conplot - a console plotter
- The best natural language search commentary on the internet
- Are women discriminated against in graduate admissions? Simpson’s paradox via R in three easy steps!
- a regression slope is a weighted average of pairs’ slopes!
- Datawocky: More data usually beats better algorithms
- Allende’s cybernetic economy project
- Quick-R, the only decent R documentation on the internet
- Spending money on others makes you happy
- color name study i did
- PHD Comics: Humanities vs. Social Sciences
- data data data
- Food Fight
- Graphics! Atari Breakout and religious text NLP
- Moral psychology on Amazon Mechanical Turk
- Will the humanities save us?
- Indicators of a crackpot paper
- What is experimental philosophy?
- Data-driven charity
- Race and IQ debate - links
- How did Freud become a respected humanist?!
- Actually that 2008 elections voter fMRI study is batshit insane (and sleazy too)
- Pop cog neuro is so sigh
- Authoritarian great power capitalism
- neo institutional economic fun!
- Verificationism dinosaur comics
- EEG for the Wii and in your basement
- Dollar auction
- ConnectU.com SQL injection vulnerability: a story of pathetic hubris (and fun with the password ‘password’)
- It’s all in a name: "Kingdom of Norway" vs. "Democratic People’s Republic of Korea"
- When’s the last time you dug through 19th century English mortuary records
- Are ideas interesting, or are they true?
- Cooperation dynamics - Martin Nowak
- China: fines for bad maps
- Cerealitivity
- Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!
- Washington in 1774
- Happiness incarnate on the Colbert Report
- Evangelicals vs. Aquarians
- "Time will tell, epistemology won’t"
- Richard Rorty has died
- Freak-Freakonomics (Ariel Rubinstein is the shit!)
- "Stanford Impostor"
- Rock Paper Scissors psychology
- Simpson’s paradox is so totally solved
- More fun with Gapminder / Trendalyzer
- Gapminder.org — terrific world data visualizations
- Random search engine searcher
- Evil
- Seth Roberts and academic blogging
- Statistics is big-N logic?
- Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do
- Computability and induction and ideal rationality and the simpsons
- Iraq is the 9th deadliest civil war since WW2
- Pascal’s Wager
- When linguists appear on ironic parody talk shows
- The Jungle Economy
- funny comic
- Anarchy vs. social order in Somalia
- Double thesis action
- A big, fun list of links I’m reading
- 4-move rock, paper, scissors!
- Two Middle East politics visualizations
- neuroscience and economics both ways
- Social network-ized economic markets
- Rock, Paper, Scissors
- Neuroeconomics reviews
- Lordi goes to Eurovision
- Drunken monkeys experiment!
- Easterly vs. Sachs on global poverty
- high irony
- The identity politics of satananic zombie alien man-beasts
- new kind of science, for real
- Mark Turner: Toward the Founding of Cognitive Social Science
- Libertarianism and evolution don’t mix
- academic blogging
- science writing bad!
- Bush approval ratings
- Kurzweil interview
- cognitive modelling is rational choice++
- Submit your poker data!
- Bayesian analysis of intelligent design (revised!)
- searchin’ for our friend, homo economicus
- balkanized USA
- war death statistics
- guns, germs, & steel pbs show?!
- the psychology of design as explanation
- another blog: cog psych and political/social stuff
- a bayesian analysis of intelligent design
- Statistical inference and social science
- finding some decision science blogs
- Social economics and rationality
- City crisis simulation (e.g. terrorist attack)
- freakonomics blog
- Supreme Court justices’ agreement levels
- $ echo {political,social,economic}{cognition,behavior,systems}
- Modelling environmentalism thinking
- monkey economics (and brothels)
- more argumentation & AI/formal modelling links
- zombies!
- looking for related blogs/links
- idea: Morals are heuristics for socially optimal behavior
- 1st International Conference on Computational Models of Argument (COMMA06)
- Online Deliberation 2005 conference blog & more is up!
- go science
- addiction & 2 problems of economics
- gintis: theoretical unity in the social sciences
Comparison of data analysis packages: R, Matlab, SciPy, Excel,
SAS, SPSS, Stata
Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.Name | Advantages | Disadvantages | Open source? | Typical users |
R | Library support; visualization | Steep learning curve | Yes | Finance; Statistics |
Matlab | Elegant matrix support; visualization | Expensive; incomplete statistics support | No | Engineering |
SciPy/NumPy/Matplotlib | Python (general-purpose programming language) | Immature | Yes | Engineering |
Excel | Easy; visual; flexible | Large datasets | No | Business |
SAS | Large datasets | Expensive; outdated programming language | No | Business; Government |
Stata | Easy statistical analysis | No | Science | |
SPSS | Like Stata but more expensive and worse |
There’s a bunch more to be said for every cell. Among other things:
- Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
- Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
- Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
- Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
- I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
- Here’s an interesting reddit thread on SAS/Stata vs R.
- SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
- SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
- Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
- But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
- If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
- (This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
- SAS people complain about poor graphing capabilities.
- R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
- Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
- Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
- Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
- Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
- I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.
•
114 comments to “Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata”
AI and Social Science - Brendan O’Connor
Cognition, systems, decisions, visualization, machine
learning, etc.
About
This is a blog on artificial intelligence and social science —
call it "Social Science++" — with an emphasis on computation and statistics. My
general website is anyall.org.
All Posts
Best posts are bold.- An ML/AI approach to P != NP
- Updates: CMU, Facebook
- quick note: cer et al 2010
- How Facebook privacy failed me
- List of probabilistic model mini-language toolkits
- Seeing how “art” and “pharmaceuticals” are linguistically similar in web text
- Quiz: “art” and “pharmaceuticals”
- Don’t MAWK AWK - the fastest and most elegant big data munging language!
- Patches to Rainbow, the old text classifier that won’t go away
- Another R flashmob today
- Beautiful Data book chapter
- Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features
- Blogger to Wordpress migration helper
- R questions on StackOverflow
- FFT: Friedman + Fortran + Tricks
- Road in a forest
- Beta conjugate explorer
- Michael Jackson in Persepolis
- Psychometrics quote
- June 4
- Where tweets get sent from
- Zipf’s law and world city populations
- Announcing TweetMotif for summarizing twitter topics
- Performance comparison: key/value stores for language model counts
- 1 billion web page dataset from CMU
- Pirates killed by President
- Binary classification evaluation in R via ROCR
- Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
- La Jetee
- “Logic Bomb”
- SF conference for data mining mercenaries
- Love it and hate it, R has come of age
- Facebook sentiment mining predicts presidential polls
- Can social media prevent genocide?
- Statistics vs. Machine Learning, fight!
- Calculating running variance in Python and C++
- Python bindings to Google’s “AJAX” Search API
- Netflix Prize
- The Wire: Mr. Nugget
- Correlations - cotton picking vs. 2008 Presidential votes
- Disease tracking with web queries and social messaging (Google, Twitter, Facebook…)
- Obama street celebrations in San Francisco
- Twitter graphs of the debate
- Is religion the opiate of the elite?
- Financial market theory on the Daily Show
- The Universal Declaration of Human Rights Animated
- It is accurate to determine a blog’s bias by what it links to
- Blog move has landed
- MyDebates.org, online polling, and potentially the coolest question corpus ever
- PalinSpeak.com
- "Machine" translation/vision (Stanford AI courses online)
- Fukuyama: Authoritarianism is still against history
- A better Obama vs McCain poll aggregation
- East vs West cultural psychology!
- The MacGyver of data analysis
- Link: Today’s international organizations
- Bias correction sneak peek!
- Turker classifiers and binary classification threshold calibration
- Pairwise comparisons for relevance evaluation
- Clinton-Obama support visualization
- Sub-reddit for Systems Science and OR
- conplot - a console plotter
- The best natural language search commentary on the internet
- Are women discriminated against in graduate admissions? Simpson’s paradox via R in three easy steps!
- a regression slope is a weighted average of pairs’ slopes!
- Datawocky: More data usually beats better algorithms
- Allende’s cybernetic economy project
- Quick-R, the only decent R documentation on the internet
- Spending money on others makes you happy
- color name study i did
- PHD Comics: Humanities vs. Social Sciences
- data data data
- Food Fight
- Graphics! Atari Breakout and religious text NLP
- Moral psychology on Amazon Mechanical Turk
- Will the humanities save us?
- Indicators of a crackpot paper
- What is experimental philosophy?
- Data-driven charity
- Race and IQ debate - links
- How did Freud become a respected humanist?!
- Actually that 2008 elections voter fMRI study is batshit insane (and sleazy too)
- Pop cog neuro is so sigh
- Authoritarian great power capitalism
- neo institutional economic fun!
- Verificationism dinosaur comics
- EEG for the Wii and in your basement
- Dollar auction
- ConnectU.com SQL injection vulnerability: a story of pathetic hubris (and fun with the password ‘password’)
- It’s all in a name: "Kingdom of Norway" vs. "Democratic People’s Republic of Korea"
- When’s the last time you dug through 19th century English mortuary records
- Are ideas interesting, or are they true?
- Cooperation dynamics - Martin Nowak
- China: fines for bad maps
- Cerealitivity
- Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!
- Washington in 1774
- Happiness incarnate on the Colbert Report
- Evangelicals vs. Aquarians
- "Time will tell, epistemology won’t"
- Richard Rorty has died
- Freak-Freakonomics (Ariel Rubinstein is the shit!)
- "Stanford Impostor"
- Rock Paper Scissors psychology
- Simpson’s paradox is so totally solved
- More fun with Gapminder / Trendalyzer
- Gapminder.org — terrific world data visualizations
- Random search engine searcher
- Evil
- Seth Roberts and academic blogging
- Statistics is big-N logic?
- Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do
- Computability and induction and ideal rationality and the simpsons
- Iraq is the 9th deadliest civil war since WW2
- Pascal’s Wager
- When linguists appear on ironic parody talk shows
- The Jungle Economy
- funny comic
- Anarchy vs. social order in Somalia
- Double thesis action
- A big, fun list of links I’m reading
- 4-move rock, paper, scissors!
- Two Middle East politics visualizations
- neuroscience and economics both ways
- Social network-ized economic markets
- Rock, Paper, Scissors
- Neuroeconomics reviews
- Lordi goes to Eurovision
- Drunken monkeys experiment!
- Easterly vs. Sachs on global poverty
- high irony
- The identity politics of satananic zombie alien man-beasts
- new kind of science, for real
- Mark Turner: Toward the Founding of Cognitive Social Science
- Libertarianism and evolution don’t mix
- academic blogging
- science writing bad!
- Bush approval ratings
- Kurzweil interview
- cognitive modelling is rational choice++
- Submit your poker data!
- Bayesian analysis of intelligent design (revised!)
- searchin’ for our friend, homo economicus
- balkanized USA
- war death statistics
- guns, germs, & steel pbs show?!
- the psychology of design as explanation
- another blog: cog psych and political/social stuff
- a bayesian analysis of intelligent design
- Statistical inference and social science
- finding some decision science blogs
- Social economics and rationality
- City crisis simulation (e.g. terrorist attack)
- freakonomics blog
- Supreme Court justices’ agreement levels
- $ echo {political,social,economic}{cognition,behavior,systems}
- Modelling environmentalism thinking
- monkey economics (and brothels)
- more argumentation & AI/formal modelling links
- zombies!
- looking for related blogs/links
- idea: Morals are heuristics for socially optimal behavior
- 1st International Conference on Computational Models of Argument (COMMA06)
- Online Deliberation 2005 conference blog & more is up!
- go science
- addiction & 2 problems of economics
- gintis: theoretical unity in the social sciences
Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.Name | Advantages | Disadvantages | Open source? | Typical users |
R | Library support; visualization | Steep learning curve | Yes | Finance; Statistics |
Matlab | Elegant matrix support; visualization | Expensive; incomplete statistics support | No | Engineering |
SciPy/NumPy/Matplotlib | Python (general-purpose programming language) | Immature | Yes | Engineering |
Excel | Easy; visual; flexible | Large datasets | No | Business |
SAS | Large datasets | Expensive; outdated programming language | No | Business; Government |
Stata | Easy statistical analysis | No | Science | |
SPSS | Like Stata but more expensive and worse |
There’s a bunch more to be said for every cell. Among other things:
- Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
- Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
- Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
- Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
- I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
- Here’s an interesting reddit thread on SAS/Stata vs R.
- SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
- SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
- Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
- But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
- If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
- (This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
- SAS people complain about poor graphing capabilities.
- R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
- Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
- Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
- Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
- Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
- I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.
•
114 comments to “Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata”
-
Eric Sun wrote:>>I know dozens of people under 30 doing statistical stuff and only one knows SAS.
23. February 2009 at 8:53 pm :
I’m assuming the “one” is me, so I’ll just say a few points:
I’m taking John Chambers’s R class at Stanford this quarter, so I’m slowly and steadily becoming an R convert.
That said, I don’t think anything besides SAS can do well with datasets that don’t fit in memory. We used SAS in litigation consulting because we frequently had datasets in the 1-20 GB range (i.e. can fit easily on one hard disk but difficult to work with in R/Stata where you have to load it all in at once) and almost never larger than 20GB. In this relatively narrow context, it makes a lot of sense to use SAS: it’s very efficient and easy to get summary statistics, look at a few observations here and there, and do lots of different kinds of analyses. I recall a Cournot Equilibrium-finding simulation that we wrote using the SAS macro language, which would be quite difficult in R, I think. I don’t have quantitative stats on SAS’s capabilities, but I would certainly not think twice about importing a 20 GB file into SAS and working with it in the same way as I would a 20 MB file.
That said, if you have really huge internet-scale data that won’t fit on one hard drive, then SAS won’t be too useful either. I’ll be very interested if this R + Hadoop system ever becomes mature: http://www.stat.purdue.edu/~sguha/rhipe/
In my work at Facebook, Python + RPy2 is a good solution for large datasets that don’t need to be loaded into memory all at once (for example, analyzing one facebook network at a time). If you have mutliple machines, these computations can be speeded up using iPython’s parallel computing facilities.
Also, R’s graphical capabilities continue to surprise me; you can actually do a lot of advanced stuff. I don’t do much graphics, but perhaps check out “R Graphics” by Murrell or Deepayan Sarkar’s book on Lattice Graphics.
-
Eric Sun wrote:I thought that most people consider SAS to have the highest learning curve, certainly higher than R. but maybe I’m mistaken about that.
23. February 2009 at 8:55 pm :
-
Justin wrote:Calling scipy immature sounds somehow “wrong”. The issues you come up with are more of early design flaws that will not go away, no matter how “mature” scipy is getting.
23. February 2009 at 10:24 pm :
That said, these are flaws, but they seem pretty minor to me.
-
I’ve recently seen GNU DAP mentioned as an open-source equivalent to SAS.
Know if it’s any good?
-
TS Waterman wrote:Have you considered Octave in this regard? It’s a GNU-licensed Matlab clone. Very nice graphing capability, Matlab syntax and library functions, open source.
23. February 2009 at 10:49 pm :
http://www.gnu.org/software/octave/FAQ.html#MATLAB-compatibility
-
@Eric - oops, yeah should’ve put SAS as hardest. Good point that the standard
of judging how good large dataset support is, is whether you can manipulate a
big dataset the same way you manipulate a small dataset. I’ve loaded 1-2 GB of
data into R and you definitely have to do things differently (e.g. never use
by()).
@Justin - scipy certainly seems like it keeps improving. I just keep comparing it to matlab and it’s constantly behind. I remember once watching someone try to make a 3d plot. He spent quite a while going through various half-baked python solutions that didn’t work. Then he booted up matlab and had one in less than a minute. Matlab’s functionality is well-designed, well-put-together and well-documented.
@Edward - I have seen it mentioned too. From glancing at its home page, it seems like a pretty small-time project.
-
@TS - yeah, i used octave just once for something simple. it worked fine. my
issues were: first, i’m not impressed with gnuplot graphing. second, the
interactive environment isn’t too great. third, trying to clone the matlab
language seems crazy since it’s kind of crappy. i think i’d usually pick scipy
over octave if being free is a requirement, else go with matlab if i have access
to it.
otoh it looks like it supports some nice things like sparse matrices that i’ve had a hard time with lately in R and scipy. i guess worth another look at some point…
-
Brendan,
Nice overview, I think another dimension you don’t mention — but which Bo Cowgill alluded to at our R panel talk — is performance. Matlab is typically stronger in this vein, but R has made significant progress with more recent versions. Some benchmark results can be found at:
http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php
MD
-
Mike wrote:In high energy particle physics, ROOT is the package of choice. It’s distributed by CERN, but it’s open source, and is multi-platform (though the Linux flavor is best supported). It does solve some of the problems you mentioned, like running over large datasets that can’t be entirely memory-resident. The syntax is C++ based, and has both an interpreter and the ability to compile/execute scripts from the command line.
23. February 2009 at 11:27 pm :
There are lots of reasons to prefer other packages (like R) over ROOT for certain tasks, but in the end there’s little that can be done with other packages that one cannot do with ROOT.
-
This is obviously oversimplified - but that is the point of a succinct
comparison. I would add that you are missing a lot of disadvantages for Excel -
it has incomplete statistics support and an outdated “language” :)
Python actually really shines above the others for handling large datasets using memmap files or a distributed computing approach. R obviously has a stronger statistics user base and more complete libraries in that area - along with better “out-of-the-box” visualizations. Also, some of the benefits overlap - using numpy/scipy you get that same elegant matrix support / syntax that matlab has, basically slicing arrays and wrapping lapack.
The advantages of having a real programming language and all the additional non-statistical libraries & frameworks available to you make Python the language of choice for me. If there is something scipy is weak at that I need, I’ll also use R in a pinch or move down to C. I think you are basically operating at a disadvantage if you are using the other packages at this point. The only other reason I can see to use them is if you have no choice, for example if you inherited a ton of legacy code within your organization.
-
I’m sure you’ve stirred up a lot of controversy. Thanks for calling ‘em like
you see ‘em.
As for Mathematica, I haven’t used it for statistics beyond some basic support for common distributions. But one thing it does well is very consistent syntax. I used it when it first came out, then didn’t use if for years, and then started using it again. When I came back to it, I was able to pick it up right where I left off. I can’t put R down for a week and remember the syntax. Mathematica may not do everything, but what it does do, it does elegantly.
-
it would be awesome to have an informal, hands on tutorial comparison of
several of these languages (looking at ease, performance, features, etc.). maybe
a meetup at something like super
happy dev house, or even something separate. just a thought!
-
@Michael Driscoll - good point! I was afraid to make performance claims since
I’ve heard that Matlab is getting faster, they have a JIT or a nice compiler or
something now, and I haven’t used it too much recently. (That benchmark page
doesn’t even say which matlab version was used, though I emailed the guy…) I’m
also suspicious of performance comparisons since I’d expect much of it to be
very dependent on the matrix library and there are several LAPACKs out there
(ATLAS and others) and many compiletime parameters to fiddle with. I think I
read something claiming many binary builds of R don’t use the best LAPACK they
could. I’m not totally sure of this though. But if it’s true that Matlab knows
how to vectorize for-loops, that’s really impressive.
@Mike - ah yes, i remember looking at ROOT a long time ago and thinking it was impressive. But then I forgot about it because all the cs/stats people whose stuff I usually read don’t know about it. I think it just goes to show you that the data analysis tools problem is tackled so differently by different groups of people, it’s very easy to not miss out on better options just due to lack of information!
@Pete - yeah I whine about python. but I seem to use numpy plenty still :) actually its freeness is a huge win over matlab for cluster environments since you dont’ have to pay for a zillion licenses…
Hm I seem to be talking myself into thinking it’s down to R vs Python vs Matlab. then the rosetta stone http://mathesaurus.sourceforge.net/matlab-python-xref.pdf should be my guide…
@John - very interesting. I think many R users have had the experience of quickly forgetting how to do basic things.
-
From David Knowles, who did the comparison Mike Driscoll linked to
(http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php):
> Nice comparison. I would add to the pros of R/Python that the data
> structures are much richer than Matlab. The big pro of Matlab still
> seems to be performance (and maybe the GUI for some people). On top of
> being expensive Matlab is a nightmare if you want to run a program on
> lots of nodes because you need a license for every node!
>
> It’s 2008b I did the comparison with - I should mention that!
-
From Rob Slaza’s
statistics toolbox tutorials, it *seems* like using MATLAB for stats is
reasonably simple…
-
On top of being expensive Matlab is a nightmare if you want to run a program on lots of nodes because you need a license for every node!
@Brendan:
Re David Knowles’ comment…
There are specialized parallel/distributed computing tools available from MathWorks for writing large-scale applications (for clusters, grid etc.). You should check out: http://www.mathworks.com/products/parallel-computing.
Running full-fledged desktop MATLAB on a huge number of nodes is messy and of course very expensive not to mention that a single user would take away several licenses for which other users will have to wait.
Disclosure: I work for the parallel computing team at The MathWorks
-
Another guy from Mathworks, their head of Matlab product management Scott
Hirsch, contacted me about the language issue and was very kind and
clarifi-cative. The most interesting bits below.
On Tue, Feb 24, 2009 at 7:20 AM, Scott Hirschwrote:
>> Brendan –
>>
>> Thanks for the interesting discussion you got rolling on several popular
>> data analysis packages
[...]
>> I’m always very interested to hear the perspectives of MATLAB users, and
>> appreciate your comments about what you like and what you don’t like. I was
>> interested in following up on this comment:
>>
>> “Matlab’s language is certainly weak. It sometimes doesn’t seem to be
>> much more than a scripting language wrapping the matrix libraries. “
>>
>> I have my own assumptions about what you might mean, but I’d be very
>> interested in hearing your perspectives here. I would greatly appreciate it
>> if you could share your thoughts on this subject.
>
> sure. most of my experiences are with matlab 6. just briefly,
>
> * leave out semicolon => print the expression. that is insane.
> * each function has to be defined in its own file
> * no optional arguments
> * no named arguments
> * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> * no perl/python-style hashes
> * no object orientation (or just a message dispatch system) … less
> important
> * poor/no support for text
> * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
On Tue, Feb 24, 2009 at 11:27 AM, Scott Hirschwrote:
> Thanks, Brendan. This is very helpful. Some of the things have been
> addressed, but not all. Here are some quick notes on where we are today.
> Just to be clear – I have no intention (or interest) in changing your
> perspectives, just figured I could let you know in case you were curious.
>
>
>
> > * leave out semicolon => print the expression. that is insane.
> No plans to change this. Our solution is a bit indirect, but doesn’t break
> the behavior that lots of users have come to expect. We have a code
> analysis tool (M-Lint) that will point out missing semi-colons, either while
> you are editing a file, or in a batch process for all files in a directory.
>
> > * each function has to be defined in its own file
> You can include multiple functions in a file, but it introduces unique
> semantics – primarily that the scope of these functions is limited to within
> the file.
[[ addendum from me: yeah, exactly. if you want to make functions that are
shared in different pieces of your code, you usually have to do 1 function per
file. ]]
> > * no optional arguments
> Nothing yet.
>
> > * no named arguments
> Nope.
>
> > * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> We’ve had structures since MATLAB 5.
[[ addendum from me: well, structures aren't very conventional in standard
matlab style, or at least certainly not the standard library. most algorithm
functions return a tuple of variables, instead of packaging things together
into a structure. ]]
> > * no perl/python-style hashes
> We just added a Map container last year.
>
> > * no object orientation (or just a message dispatch system) … less
> important
> We had very weak OO capabilities in MATLAB 6, but introduced a modern system
> in R2008a.
>
> > * poor/no support for text
> This has gotten a bit better, primarily through the introduction of regular
> expressions, but can still be awkward.
>
> > * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
> Not much here, other than a smattering (Database Toolbox for SQL,
> miscellaneous commands for web interaction, WSDL, …)
>
> Thanks again. I really do appreciate getting your perspective. It’s
> helpful for me to understand how MATLAB is perceived.
>
> -scott
-
@Gaurav - it sure would be nice if i could see how much this parallel toolbox
costs without having to register for a login!
-
There is another good numpy/matlab comparison here:
http://www.scipy.org/NumPy_for_Matlab_Users
As of the last year, a standard ipython install ( “easy_install IPython[kernel]” ) now includes parallel computing right out of the box, no licenses required:
http://ipython.scipy.org/doc/rel-0.9.1/html/parallel/index.html
If this is going to turn into a performance shootout, then I’ll add that from what I’ve seen Python with numpy/scipy outperforms Matlab for vectorized code.
My impression has been that performance order is Numpy > Matlab > R, but as my friend Mike Salib used to say - “All benchmarks are lies”. Anyway, competition is good and discussions like this keep everyone thinking about how to improve their platforms.
Also, keep in mind that performance is often a sticking point for people when it need not be. One of the things I’ve found with dynamically typed languages is that ease of use often trumps raw performance - and you can always move the intensive stuff down to a lower level.
For people who like poking at numbers:
http://www.scipy.org/PerformancePython
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14685.html
http://www.mail-archive.com/numpy-discussion@scipy.org/msg01282.html
Sturla has some strong points here:
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14697.html
-
thrope wrote:@brendano - I think it might be a case of “if you have to ask you can’t afford it” :)
25. February 2009 at 11:44 am :
-
devicerandom wrote:What about Origin (and Linux/Unix open source clones like Qtiplot)? I know a lot of people using them, and they allow fast, easy statistical analysis with beautiful graphs out of the box. Qtiplot is quite immature but it is Python-scriptable, which is a definitive plus for me -I don’t know about Origin.
25. February 2009 at 11:48 am :
-
Hi. I think this is a very incomplete comparison. If you want to make a real
comparison, it should be more complete than this
wiki article . And to give a bit of personal feedback:
I know 2 people using STATA (social science), 2 people using Excel (philosophy and economics), several using LabView (engineers), some using R (statistical science, astronomy), several using S-Lang (astronomy), several using Python (astronomy) and by using Python, I mean that they are using the packages they need, which might be numpy, scipy, matplotlib, mayavi2, pymc, kapteyn, pyfits, pytables and many more. And this is the main advantage of using a real language for data analysis: you can choose among the many solutions the one that fits you best. I also know several people who use IDL and ROOT (astronomy and physics).
I have used IDL, ROOT, PDL, (Excel if you really want to count that in) and Python and I like Python best :-)
@brendano: One other note: I think that you really have to distinguish between data analysis and data visualization. In astronomy this is often handled by completely different software. The key here is to support standardized file storage/ exchange formats. In your example the people used scipy which does not offer a single visualization routine, so you can not blame scipy for difficulties with 3D plots…
-
I am a core scipy/numpy developer, and I don’t think calling them immature
from a user POV is totally unfair. Every time someone tries
numpy/scipy/matplotlib and cannot plot something simple in a couple of minutes
is a failure of our side. I can only say that we are improving - projects like
pythonxy or enthought are really helpful too for people who want something more
integrated.
There is no denying than if you are into an integrated solution, numpy/scipy is not the best solution of the ones mentioned today - it may well be the worse (I don’t know them all, but I am very familiar with matlab, and somewhat familiar with R). There is a fundamental problem for all those integrated solutions: once you hit their limitations, you can’t go beyond that. Not being able to handle data which do not fit in memory in matlab, that’s a pretty fundamental issue, for example. Not having basic data structures (hashmap, tree, etc…) another one. Making advanced UI in matlab, not easy either.
You can build your own solution with the python stack: the numpy array capabilities are far beyond matlab’s one, for example (broadcasting, advanced indexing are much powerful than matlab current capabilities). The C API is complete, and you can do things which are simply not possible with matlab. You want to handle very big datasets ? pytables give you a database-like API on top of hdf5. Things like cython are also very powerful for people who need speed. I believe those are partially consequences of not being integrated.
Concerning the flaws you mentioned (scipy.linalg vs numpy.linalg, etc…): those are mostly legacies, or exist because removing them would be too costly. There are some efforts to remove redundancy, but not all of them will disappear. They are confusing for a newcomer (they were for me), but they are pretty minor IMHO, compared to other problems.
-
bill wrote:You forgot support and continuity. In my experience, SAS offers very good support and continuity. Others claim SPSS does, too (I have no experience there). In a commercial environment, the programs need to outlive the analyst and the whims of the academic/grad student support/development. For one-off disposable projects, R has lots of advantages. For commercial systems, not so many.
25. February 2009 at 2:29 pm :
-
Lou Pecora wrote:I’ve looked at several of the “packages” mentioned here (R, Octave, MATLAB, C, C++, Fortran, Mathematica). I’m a physicist who is often working in new fields where understanding the phenomena is the main goal. This means my colleagues and I are often developing new numerical/theoretical/data-analysis approaches. For anyone in this situation I unequivocally recommend:
25. February 2009 at 4:45 pm :
Python.
Why? Because given my situation there often are no canned routines. That means soon or later (usually sooner) I will be programming. Of all the languages and packages I’ve used Python has no equal. It is object oriented, has very forgiving run-time behavior, fast turn around (no edit, compile, debug cycles — just edit and run cycles), great built in structures, good modularity, and very good libraries. And, it’s easy to learn. I want to spend my time getting results, not programming, but I have to go through code development since often nothing like what I want to do exists and I’ve got to link the numerics to I/O and maybe some interactive things that make it easy to use and run smoothly. I’ve taken on projects that I would not want to attempt in any of the packages/languages I’ve listed.
I agree that Python is not wart-free. The version compatibility can sometimes be frustrating. “One-stop shopping” for a complete Python package is not here, yet (although Enthought is making good progress). It will never be as fast as MATLAB for certain things (JIT compiling, etc. makes MATLAB faster at times). Python plotting is certainly not up to Mathematica standards (although it is good).
However, the Python community is very nice and very responsive. Python now has several easy ways to add extensions written in C or C++ for faster numerics. And for all my desire not to spend time coding, I must admit I find Python programming fun to do. I cannot say that for anything else I’ve used.
-
There is good reason for the duplication of “linalg” in SciPy. SciPy’s brand
has more features which probably aren’t of as much use to as wide an audience,
and (perhaps more importantly) one of the requirements for NumPy is that it not
depend critically on a Fortran compiler. SciPy relaxes this requirement, and
thus can leverage a lot of existing Fortran code. At least that’s my
understanding.
-
These packages change and it’s easy to get locked-in ideas from the past. I
haven’t used Matlab since the 1990s, but the last time I used it, its I/O and
singular value decomposition was so slow that we switched to S-Plus just to
finish in our lifetimes.
Can any of these packages compute sparse SVDs like folks have used for Netflix (500K x 25K matrix with 100M partial entries)? Or do regressions with millions of items and hundreds of thousands of coefficients? I typically wind up writing my own code to do this kind of thing in LingPipe, as do lots of other folks (e.g. Langford et al.’s Vowpal Wabbit, Bottou et al.’s SGD, Madigan et al.’s BMR).
What’s killing me now is scaling Gibbs samplers. BUGS is even worse than R in terms of scaling, but I can write my own custom samplers that fly in some cases and easily scale. I think we’ll see more packages like Daume’s HBC for this kind of thing.
R itself tends to just wrap the real computing in layers of scripts to massage data and do error checking. The real code is often Fortran, but more typically C. That must be the same for SciPy given how relatively inefficient Python is at numerical computing. It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.
A problem I see with the way R and BUGS work is that they typically try to compile a declarative model (e.g. a regression equation in R’s glm package or a model specification in BUGS), rather than giving you control over the basic functionality (optimization or sampling).
The other thing to consider with these things from a commercial perspective is licensing. R may be open source, but its Gnu license means we can’t really deploy any commercial software on top of it. Sci-Py has a mixed bag of licenses that is also not redistribution friendly. I don’t know what licensing/redistribution looks like for the other packages.
@bill Support and continuity (by which I assume you mean stability of interfaces and functionality) is great in the core R and BUGS. The problem’s in all the user-contributed packages. Even there, the big ones like lmer are quite stable.
-
As for the rather large speed gains made by recent MATLAB releases that Lou
noted, I believe this is due in most part to their switch to the Intel
Math Kernel Library in place of a well-tuned ATLAS (I’m not completely sure
if that’s what they used before, but it’s a good bet). This hung a good number
of people with PowerPC G5’s out to dry rather quickly as newer MATLABs
apparently only run on Intel Macs (probably so they don’t have to maintain two
separate BLAS backends).
Accelerated linear algebra routines written by people who know the processors inside and out will result in big wins, obviously. You can also license the IKML separately and use it to compile NumPy (if I recall correctly, David Cournapeau who commented above was largely responsible for this capability, so bravo!). I figure it’s only a matter of time before somebody like Enthought latch onto the idea of selling a Python environment with IKML baked in, so you can get the speedups without the hassle.
-
@ben The SciPy team was also unhappy about the licensing issue, so you’ll be
glad to hear that SciPy 0.7 was released under a single, BSD license.
You said “It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.” We are currently working on ways to expose the mathematical functions underlying NumPy to C, so that you can access it in your extension code. During the last Google Summer of Code, the Cython team implemented a friendly interface between Cython and NumPy. This means that you can code your algorithms in Python, but still have the speed benefits of C.
A number of posts above refer to plotting in 3D. I can recommend Enthought’s Mayavi2, which makes interactive data visualisation a pleasure:
http://code.enthought.com/projects/mayavi/
We are always glad for suggestions on how to improve SciPy, so if you do try it out, please join the mailing list and tell us more about your experience.
-
You should probably add GenStat to your list, this is a UK package
specialising in the biosciences. It’s a relative heavy-weight in stats having
come from Rothamsted Research (home of Fisher, Yates and Nelder). Nelder was the
actual originator of GenStat. GenStat is also free for teaching world-wide and
free for research to the developing world. It’s popularity is mainly within
Europe, Africa and Oceania, hence why many US researchers may not have heard of
it. I hope this helps
-
Wow, this is the funnest language flamewar I’ve seen.
I will note that no one defended SAS. Maybe those people don’t read blogs.
-
bill wrote:brendano,
27. February 2009 at 3:26 am :
Hmm, I thought I did. I do production work in SAS and mess around (test new stuff, experimental analyses) in R.
Bill
-
Oops. Yes yes. My bad!
OK: no one has defended Stata!
-
John Dudley wrote:My company has been using StatSoft’s Statistica for years and it does all of the things that you found to be shortcomings of SAS, SPSS and Matlab…
4. March 2009 at 2:46 pm :
It’s fast, graphs are great and are virtually no limitations. I’m suprised it wasn’t listed as one of the packages reviewed. We have been using it for years and it is absolutely critical to our business model.
-
StatSoft is the only major package with R integration…The best of both
worlds.
-
Abhijit wrote:In stats there seems to be the S-Plus/R schools and the SAS schools. SAS people find R obtuse with poor documentation, and the R people say the same about SAS (myself included). R wins in graphics and flexibility and customizability (though I certainly won’t argue with a SAS pro who can whip up macros). SAS seems a bit better with large data sets. R is ever expanding, and has improved greatly for simulations/looping and memory management. Recently for large datasets (bioinformatic, not the 5-10G financial ones), I’ve used a combination of Python and R to great effect, and am very pleased with the workflow. I think rpy2 is a great addition to Python and works quite well. For some graphs I actually prefer matplotlib to R.
5. March 2009 at 3:38 am :
I’m also a big fan of Stata for more introductory level stuff as well as for epidemiology-related stuff. It is developing a programming language that seems useful. One real disadvantage in my book is its ability to hold only one dataset at a time, as well as a limit on the data size.
I’ve also used Matlab for a few years. It’s statistics toolbox is quite good, and Matlab is pretty fast and has great graphics. It’s limited in terms of regression modeling to some degree, as well as survival methods. Syntactically I find R more intuitive for modeling (though that is the lineage I grew up with). The other major disadvantage of matlab is distribution of programs, since Matlab is expensive. The same complaint for SAS, as well:)
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata ? X [...]
-
I’ll sing the same song here as I do elsewhere on this topic.
In large-scale production, SAS is second to none. Of course, large-scale production shops usually have the $$$ to fork over, and SAS’s workflow capabilities (and, to a lesser extent, large dataset handling capabilities) save enough billable hours to justify the cost. However, for graphics, exploratory data analysis, and analysis beyond the well-established routines, you have to venture into the world of SAS/IML, which is a rather painful place to be. It’s PRNGs are also stuck in the last century, top of the line of a class obsolete for anything other than teaching.
R is great for simulation, exploratory data analysis, and graphics. (I disagree with the assertion that R can’t do high-quality graphics, and, like some commenters above, recommend Paul Murrell’s book on the topic.) It’s language, while arcane, is powerful enough to write outside-the-box analyses. For example, I was able to quickly write, debug, and validate an unconventional ROC analysis based on a paper I read. As another example, bootstrapping analyses are much easier in R than SAS.
In short, I keep both SAS and R around, and use both frequently.
I can’t comment too much on Python. MATLAB (or Octave or Scilab) is great for roll-your-own statistical analyses as well, though I can’t see using it for, e.g., a conventional linear models analysis unless I wanted the experience. R’s matrix capabilities are enough for me at this point. I used Mathematica some time ago for some chaos theory and Fourier/wavelet analysis of images and it performed perfectly well. If I could afford to shell out the money for a non-educational license, I would just to have it around for the tasks it does really well, like symbolic manipulation.
I used SPSS a long time ago, and have no interest in trying it again.
-
SPSS has for several years been offering smooth integration with both Python
and R. There are extensive apis foe both. Check out the possibilities at http://www.spss.com/devcentral. See
also my blog at insideout.spss.com.
You can even easily build SPSS Statistics dialog boxes and syntax for R and Python programs. DevCentral has a collection of tools to facilitate this.
This integration is free with SPSS Base.
-
[...] comparando software statísticos (R, SAS, SPSS, MATLAB e Stata).
[...]
-
Sean wrote:I used Matlab, R, stata, spss and SAS over the years.
11. March 2009 at 4:18 am :
To me, the only reason for using sas is because of its large data ability. otherwise, it is a very very bad program. It, from day one, trains it users to be a third rate programmer.
The learning curve for SAS is actually very steep, particularily for a very logical person. Why? the whole syntax in SAS is pretty illogical and inconsistent.
sometimes, it is ‘/out’ sometimes, it is ‘output’.
In 9.2, SAS started to make variables inside a macro as local variables by default.
This is ridiculous!! SAS company has existed for at least 30 years. How can this basic programming rule should be implemented after 30 years?!
Also, if a variable is uninitialized, SAS will still let the code run. One time, I worked in a company, this simple stupid SAS design flaw causes our project 3 weeks of delay (there is one uninitialized varaible among 80k lines of log, all blue). A couple of PhDs in the project who used C and Matlab did not believe why SAS makes such a stupid mistake. Yes, with a big disbelief, it made!
My ranking is that Matlab and R are about the same, Matlab is better in plots most times. R is better is manipulation datasets. stata and SAS are the same level.
After taking into account of cost, then the answer is more obvious.
-
bill r wrote:SAS was not designed by a language maven, like Pascal. It grew from its PL/1 and Fortran roots. It is a collection of working tools, added to meet the demands of working statisticians and IT folk, that has grown since its start in the late ’60s and early ’70s. SAS clearly has kruft that shows its growth over time. Sort of like the UNIX tools, S, and R, actually.
12. March 2009 at 1:37 pm :
And, really, what competent programmer would ever use a variable without initializing or testing it first? That’s a basic programming rule I learned back in the mid ’60s, after branching off of uninitialized registers, and popping empty stacks.
Bah, you kids. Get off of my lawn!
-
tom p wrote:i work for a retail company that deploys SAS for their large datasets and complex analysis. just about everything else is done in excel.
13. March 2009 at 4:57 am :
we had a demo of omniture’s discover onpremise (formerly visual sciences), and the visualization tools are fairly amazing. it seems like an interesting solution for trending real time evolving data, but we aren’t pulling the trigger on it now.
-
For reference PDL (Perl Data Language) can be found at pdl.perl.org/
and is also available via CPAN
/I3az/
-
opps.. link screwed up… here goes again ;-)
pdl.perl.org
-
Have you seen Resolver One?
It’s a spreadsheet like Excel, but has built-in Python support, and allows cells
in the grid to hold objects. This means that numpy mostly works, and you can
have one cell in the grid hold a complete dataset, then manipulate that dataset
in bulk using spreadsheet-like formulae. Someone has also just built an extension that
allows you to connect it to R, too. In theory, this means that you can get
the best of all three — spreadsheet, numpy, and R — in your model, using the
right tool for each job.
On the other hand, the integration with both numpy and R is quite new, so it’s immature as a stats tool compared to the other packages in this list.
Full transparency: I work for Resolver Systems, so obviously I’m biased towards it :-) Still, we’re very keen on feedback, and we’re happy to give out free copies for non-commercial research and for open source projects.
-
Being the resident MATLAB enthusiast in a house built on another tool, I will
pitch in my two cents, by suggesting another spectrum along which these tools
lie: “canned procedures” versus “roll your own”. Use of general-purpose
programming languages, such as has been suggested in the comments for Fortran or
C/C++ clearly anchor one end of this dimension, whereas the statistical software
sporting canned routines lie all the way at the other. A tool like MATLAB, which
provides some but not complete direct statistical support, is somewhere in the
middle. The trade-off here, naturally, is the ability to customize analysis vs.
convenience.
-
Jude Ryan wrote:Most of the users on this post are biased towards packages like R, rather than packages like SAS, and I want to offer my perspective of the relative advantages and disadvantages of SAS relative to R.
16. March 2009 at 4:37 pm :
I am primarily a SAS user (over 20 years) who has been using R as needed (a few years) to do things that SAS cannot do (like MARS splines), or cannot do as well (like exploratory data analysis and graphics), or requires expensive SAS products like Enterprise Miner to do (like decision trees, neural networks, etc).
I have worked primarily for financial service (credit cards) companies. SAS is the primary statistical analysis tool in these companies partly due to history (S, the precursor to S+ and R, was not yet developed) and partly because it can run on mainframes (another legacy system) accessing huge amounts of data stored on tapes, which I am not sure any other statistical package can. Furthermore, business who have the $ will be the last to embrace open source software like R, as they generally require quick support when they get stuck trying to solve a business problem, and researching the problem in a language like R is generally not an option in a business setting.
Also, SAS’ capabilities for handling large volumes of data are unmatched. I have read huge compressed files of online data (Double Click), having over 2 billion records, using SAS, to filter the data and keep only the records I needed. Each of the resulting SAS datasets were anywhere from 35 GB to 60 GB in size. As far as I know, no other statistical tool can process such large volumes of data programatically. First we had to be able to read in the data and understand it. Sampling the data for modeling purposes came later. I would run the SAS program overnight, and it would generally take anywhere from 6 to 12 hours to complete, depending on the load on the server. In theory, any statistical software that works with records one at a time should be able to process such large volumes of data, and maybe the Python based tools can do this. I do not know as I have never used them. But I do know that R, and even tools like WEKA cannot process such volumes of data. Reading the data from a database, using R, can mitigate the large data problems encountered in R (as does using packages like biglm), but SAS is the clear leader in handling large volumes of data.
R on the other hand is better suited for academics and research, as cutting edge methodologies can be and are implemented much more rapidly in R than in SAS, as R’s programming language has more elegant support for vectors and matricies than SAS (proc IML). R’s programming language is much more elegant and logically consistent, while SAS’ programming language(s) are more adhoc with non-standard programming constructs. Furthermore, people who prefer R generally have a stronger “theoretical” programming background (most have programmed in C, Perl, or objected oriented languages) or are able to pick up programming faster, while most users who feel comfortable with SAS have less of a programming background and can tolerate many of SAS’ non-standard programming constructs and inconsistencies. These people do not require or need a comprehensive programming language to accomplish their tasks, and it takes much less effort to program in base SAS than in R if one has no “theoretical” programming background. SAS macros take more time to learn and many programming languages have no equivalent (one exception I know are C’s pre-processor commands). But languages like R do not need anything like SAS macros and can achieve the same results all in one, logically consistent, programming language, and do more, like enabling R users to write their own functions. The equivalent to writing functions in R, in SAS, is to now program a new proc in C and know how to integrate it with SAS. An extremely steep learning curve. SAS is more of a suite of products, many of them with inconsistent programming constructs (base SAS is totally different from SCL - formerly Screen Control language but now SAS Component Language), and proc SQL and proc IML are different from data step programming.
So while SAS has a shallow learning curve initially (learn only base SAS), the user can only accomplish tasks of “limited” sophistication with SAS, without resorting to proc IML (which is quite ugly). For the business world this is generally adequate. R, on the other hand, has a steeper learning curve initially, but tasks of much greater sophistication can handled more easily in R than is SAS, once R’s steeper learning curve is behind you.
I forsee an increased use of R relative to SAS over time, as many statistical departments at Universities have started teaching R (sometimes replacing SAS with R) and students graduating from these universities will be more conversant with R, or equally conversant with both SAS and R. Many of these students entering the workforce will gravitate towards R, and to the extent the companies they work for do not mandate which statistical software to use, the use of R is bound to increase over time. With memory becoming cheaper, and Microsoft based 64 bit operating systems becoming more prevalent, bigger data sets can be stored in RAM, and R’s limitation in handling large volumes of data are starting to matter less. But the amount of data is also starting to grow, thanks to the internet, scanners (used in grocery chains), etc., and the volume of data may very well grow so rapidly that even cheaper RAM and 64 bit operating systems may not be able to cope with the data deluge. But not every organization works with such large datasets.
For someone who has started their careers using SAS, SAS is more than adequate to solve all problems faced in the business world, and there may seem to be no real reason, or even justification to learn packages like R or other statistical tools. To learn R, I have put in much personal time and effort, and I do like R and have been and forsee using it more frequently over time for exploratory data analysis, and in areas where I want to implement cutting edge methodologies, and where I am not hampered by large data issues. Personally, both SAS and R will always be part of my “tool kit” and I will leverage the strengths of both. For those who do not currently use R, it would be wise to start doing so, as R is going to be more widely used over time. The number of R users has already reached critical mass, and since R is free, this is bound to increase the usage of R as the R community grows. Furthermore, the R Help Digest, and the incredibly talented R users that support it, is an invaluable aid to anyone interested in learning R.
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata - Brendan O’Co… statistics software No comments yet. [...]
-
Interesting. I don’t think I would have put SPSS and Stata in the same
category. I haven’t spend a tremendous amount of time working with SPSS, but I
have spent a fair amount of time with Stata, and my biased perspective is that
Stata is more sophisticated and powerful than SPSS. Certainly, Stata’s language
isn’t as powerful as R’s, but I definitely wouldn’t say it’s “weak.” Stata’s not
my favorite statistical program in the world (that would, of course, be R), but
there are definitely things I like about it; it’s a definite second to R in my
book.
By the way, here’s my (unfair) generalization regarding usage:
– R: academic statisticians
– SAS: statisticians and data-y people in non-academic settings, plus health scientists in academic and non-academic settings
– SPSS: social scientists
– Stata: health scientists
-
Walking Randomly » R Compared to MATLAB (or ‘learning a thing or two from your students’) wrote:[...] matrices. You don’t get much more MATLABy than matrices! Other articles such as this comparison between various data analysis packages also proved interesting and [...]
23. March 2009 at 5:58 pm :
-
xin wrote:Sean:
19. April 2009 at 2:03 am :
I am a junior SAS user with only 3 year experience. But even I know that you need to press ‘ctrl’ and ‘F’ to search for ‘uninitialized’ and ‘more than’ in SAS log to ensure everything is OK.
As far as a couple C++PHD in your group is concerned, they need to understand to play with rules of whatever system they are using……
-
xin wrote:by the way, I found the comments of SAS people left are more tolerant, open-minded (maybe they are older, lol). Instad the majority of ‘R’ers on this thread act like a bunch of rebellious teens…..
19. April 2009 at 2:07 am :
-
Joe wrote:I am a big fan of Stata over SAS for medium and small businesses. SAS is the mercedes-benz of stats I’ll admit for Govt and Big business. I use Stata a LOT for economics, it has all the most-used predictive methods (OLS, MLE, GLS, 2SLS, binary choice, etc) models built it. I think the model would have to be pretty essoteric not to be found in Stata.
30. April 2009 at 6:58 pm :
I ran Stata on linux server with 16GB ram and about 2TB of disk storage. The Hardware config was about $12K. I would not recommend using virtual memory for Stata. That said, you can stick a lot of data in 16GB ram! If I pay attention to the variable sizes (keep textual ones out), I got 100s of millons of rows into memory.
Stata supports scripting (*do files) and are very easy to use as is the GUI. The GUI is probably the best feauture.
The Hardware ($12,000) + Software ($3000 - 2 user license) costs $15,000. The equivilient SAS software was about $100,000. You do the math.
I’ve used SPSS, but that was a while ago. At that time I felt Stata was the superior product.
-
Finally a direct Stata vs SAS comparison! Very interesting. Thanks for
posting. I can’t believe SAS = $100,000.
> I ran Stata on linux server with 16GB ram and about 2TB of disk storage.
> I would not recommend using virtual memory for Stata.
In my experience, virtual memory is *always* a bad idea. I remember working with ops guys who would consider a server as good as dead once it started using swap.
All programs that effectively use hard disks always have custom code to control when to move data on and off the disk. Disk seeks and reads are just too slow and cumbersome compared to RAM to have the OS try to automatically handle it.
This would be my guess why SAS handles on-disk data so well - they put a lot of engineering work into supporting that feature. Same for SQL databases, data warehouses, and inverted text indexes. (Or the widespread popuarity of Memcached among web engineers.) R, Matlab, Stata and the rest were originally written for memory data and still work pretty much only in that setting.
-
And also, on the RAM vs hard disk issue — according to Jude Ryan’s very
interesting comment above, SAS has a heritage of working with datasets on *tape*
drives. Tape, of course, is even further along the size-vs-latency spectrum than
RAM or hard disk. Now hard disk sizes are rapidly growing but seek times are not
catching up, so people like to say “hard disk is the new tape” — therefore, if
your software was originally designed for tape, it may do best! :)
-
Here’s an overly detailed comparison of Stata, SAS, and SPSS. Basically no
coverage of R beyond the complaint that it’s too hard.
http://www.ats.ucla.edu/stat/technicalreports/
There’s also an interesting reply from Patrick Burns, defending R and comparing it to those 3.
http://www.ats.ucla.edu/stat/technicalreports/Number1/R_relative_statpack.pdf
(Found linked from a comment on John D. Cook’s blog here:
http://www.johndcook.com/blog/2009/05/01/r-the-good-parts/ )
-
Jaime wrote:I feel so old. Been using SAS for many years. But what the hell is this R ?????? That’s what the kids are using now?
27. May 2009 at 9:37 pm :
-
Gye Greene wrote:Great comparison of SPSS, SAS, and Stata by Acock (a summary of his findings here — http://www.ocair.org/files/KnowledgeBase/willard/StatPgmEvalAb.pdf)
28. May 2009 at 4:54 am :
Below is a summary of the summary — !!! — with my own observations added on.
SAS: Scripting language is awkward, but it’s great for manipulating complex data structures; folks that analyze relational DBs (e.g. govt. folks) tend to use it.
SPSS: Great for the “weekend warriors”; strongly GUI-based; has a scripting language, but it’s in-elegant. They charge a license for **each** “module” (e.g. correlations? linear regressions? Poisson regressions? A separate fee!). Also, charge an annual license. Can read Excel files directly. Used to have nicer graphs and charts than Stata (but, see below).
Stata: Elegant, short-’n'-punchy scripting language; CLI and script-oriented, but also allows GUI. Strong user base, with user-written add-ons available for D/L. **Excellent** tech support! The most recent version (Stata 10) now has some pretty powerful chart/graph editing options (GUI, plus CLI, your choice) that makes it competitive with the SPSS graphs. (Minor annoyance: ever few versions, they make the data format NOT back-compatible with the previous version — have to remember to “Save As” last-year’s version, or else what you save at work won’t open at home…)
My background: Took a course on SAS, but haven’t had a reason to use it. I’ve used SPSS and Stata both, on a reasonably regular basis: I currently teach “Intro to Methods” courses with SPSS, but use Stata for my own work. I dislike how SPSS handles missing values. Unlike SPSS, Stata sells a one-time license: once you buy a version, it’s yours to keep until you feel it’s too obsolete to use.
–GG
-
Gye Greene wrote:This may be an unfair generalization, but my personal observation is that SPSS users (within the social sciences, at least) tend to have less quantitative training than Stata users. Probably highly correlated with the GUI vs. CLI orientations of the two packages (although each of them allows for both).
28. May 2009 at 1:53 pm :
Another way of’ differentiating between various statistical software packages is its Geek Cred. I usually tell my Intro to Research Methods (for the social sciences), that…
(On a scale of 0-10…)
R, Matlab, etc. = 9
SAS = 7
Stata = 5
SPSS = 3
Excel = 2
YMMV. :)
COMMENT ON EXCEL: It’s a spreadsheet, first and foremost — so it doesn’t treat rows (cases) as “locked together”, like statistical software does. Thus, when you highlight a column and ask it to sort, it sorts **only** that column. I got burned by this once, back in my first year of grad school, T.A.-ing: sorted HW #1 scores (out of curiosity), and didn’t notice that the rest of the scores had stayed put. Oops.
I now keep my gradebooks in Stata. :)
–GG
-
Chuck Moore wrote:I began programming in SAS every day at a financial exchange in 1995. SAS has three main benefits over all other Statistical/Data Analysis packages, as far as I know.
29. May 2009 at 1:29 pm :
1) Data size = truly unlimited. I learned to span 6 DASD (Direct Access Storage Devices) = disk drives on the mainframe for when I was processing > 100 million records = quotes and trading activity from all exchanges. We we went to Unix, we used 100 GB worth of temp “WORK” space, and were processing > 1 Billion transaction a day in < 1 hour (IBM p630 with 4x 1.45 GHz processors and 32 GB of memory, only the processing actually used < 4 GB).
2) Tons and tons of preprogrammed statistical functions with just about every option possible.
3) SAS can read data from almost anything: tapes, disk, etc. fixed field flat files, delimited text files (any delimiters, not just comma or tab or space), xml, most any database, all mainframe data file times. It also translates most any text value into data, and supports custom input and output formats.
SAS is difficult for most real programmers (I took my first programming class in 1977, and have programmed in more languages than I care to share) because it has a data centric perspective as opposed to machine/control centric. It is meant to simplify the processing of large amounts of data for non-programmers.
SAS used to have incredible documentation and support, at incredibly reasonable prices. Unforturnately, the new generation of programmers and product managers have lost their way, and I agree that SAS has been becoming a beast.
For adhoc work, I immediately fell in love with SAS/EG = Enterprise Guide. Unfortunately, EG is written in .net and is not that well written. I would have preferred it being written in Java so that the interface was more portable and supported a better threading model. Oh well.
One of the better features of SAS is that it is not an intepreted programming language, but from the start in 197? it was JIT. Basically, a block of code is read, compiled, and then executed. This is why it is so efficient at processing huge amounts of data. The concept of the “data step” does allow for some built in inefficiencies from the standpoint of multiple passes through the data, but that is because of SAS’s convenience. A C programmer would have done more things, in fewer passes, but the C programmer would have spent many more hours writing the programmer than SAS’s few minutes to do the same thing. I know this because I’ve done it.
Some place I read a complaint about SAS holding only one observation in memory at a time. That is a gross misunderstanding/mistake. SAS holds one or more blocks of observations (records) in memory at a time. The number held is easily configurable. Each observation can be randomly accessed, whether in memory or not.
SAS 9.2 finally fixes one the bigger complaints with PROC FCMP allowing the creation of custom functions. Originally SAS did not support custom functions, SAS wanted to write them for you.
The most unfortunate thing about SAS currently is that it has such a long legacy on uniprocessor machines, that it is having difficulty getting going in the SMP world, being able to properly take advantage of multi-threading and multi-processing. I believe this is due to lack of proper technical vision and leadership. As such, I believe a Java language HPC derivative and tools will eventually take over, providing superior ease of use, visualization, portability, and processing speed on today’s servers and clusters. Since most data will come from an RDMS these days, flat file input won’t carry enough weight.
But, for my current profession = Capacity Planning for computer systems, you still can’t beat SAS + Excel. On the other hand, it looks like I’m going to have to look into R.
-
Chuck Moore wrote:On a side note. As a “real” programmer, having been an expert in Pascal and C and having programmed in, oh I don’t want to list them all, but I have also done more than just take classes in Java. Anyway, Macros have a place in programming. There have been a few times I wished Java supported macros and not just assertions, out of my own laziness. I am a firm believer in the right tool for the job, and that not everything is a nail, so I need more than just a hammer. The unfortunate thing is that macros can be abused, just like goto’s and programming labels and global variables.
29. May 2009 at 1:47 pm :
To me, SAS is/was the greatest data processing language/system on the planet. But, I still also program in Java, C, ksh, VBScript, Perl, etc. as appropriate. I’d like to see someone do an ARIMA forecast in Excel, or run a regression that does outlier elimination in only 3 lines of code!
-
If your dataset can’t fit on a single hard drive and you need a
cluster, none of the above will work.
One thing you have to consider, is that using SciPy, you get all of the python libraries for free. That includes the Apache Hadoop code, if you choose to use that. And as someone above pointed out, there is now parallel processing built right in in the most recent distributions (but I have no personal knowledge of that) for MPI or whatever.
Coming from an engineer in industry (not academia), the really neat thing that I like about SciPy is the ease of creating web-based tools (as in, deployed to a web server for others to use) via deployment on an apache installation and mod_python. If you can get other engineers using your analysis, without sending them a excel spreadsheet, or a .m file (for which they need a matlab license), etc. it makes your work much more visible.
-
sohan wrote:hello everyone…
14. June 2009 at 10:30 am :
i want to know about the comrative study between SAS, R, SPSS in data analysis.
can anyone provide me the papers related to those.
-
ed wrote:having used sas, spss, matlab, gauss and r, let me say that describing stata as having a weak programming language is a sign of ignorance.
18. June 2009 at 11:28 am :
it has a very powerful interpreted scripting language which allows one to easily extend stata. there is a very active community and many user written add-ons are available. see: http://ideas.repec.org/s/boc/bocode.html
stata also has a full fledged matrix programming language called (mata) comparable to matlab with a c-like syntax, which is compiled and therefore very fast.
managing and preparing data for analysis is a breeze in stata.
finally stata is easy to learn.
obviously not many people use stata around here.
some more biased opinions:
sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?
excel sucks obviously infinitely more that sas. apart from its (lack of) statistical capabilities and reliability, any point-and-click only software is an obvious no-no from the point of view of scientific reproducability
i don’t care fore spss and cannot imagine anyone does.
matlab is nice, but expensive. not so great for preparing/managing data.
have not used scipy/numpy myself, but have colleagues who love it. one big advantage is that it uses python (ie good language to master and use)
r is great, but more difficult to get into. i don’t like the loose syntax too much though. it is also a bitch with big datasets.
-
Willem wrote:On high quality graphics in R, one should certainly check out the Cairo-package. Many graphics can be output in hip formats like SVG.
17. July 2009 at 6:53 am :
-
On the point of Excel breaking down at 10,000+ rows, apparently Excel 2010
will come with Gemini, an add-on developed by the Excel and SQL team, aiming at
handling large datasets:
Project Gemini sneak preview
I doubt this would make Excel the platform of choice for doing anything fancy with large datasets anyways, but I am intrigued.
-
Jay Verkuilen wrote:Some reax, as I’ve used most of these at some point:
26. July 2009 at 9:48 pm :
SAS has great support for large files even on a modest machine. A few years ago I did a bunch of sims on my dissertation using it and it worked happily away without so much batting an eyelash on a crappy four year old Windoze XP machine with 1.5 GB of memory. Also, programs like NLP (nonlinear optimization), NLMIXED, MIXED, and GLIMMIX are really great for various mixed model applications—this is quite broad as many common models can be cast in the mixed model framework. NLMIXED in particular lets you write some pretty interesting models that would otherwise require special coding. Documentation in SAS/STAT is really solid and their tech support is great. Graphics suck and I don’t like the various attempts at a GUI.
I prefer Stata for most “everyday” statistical analysis. Don’t knock that, as it’s pretty common even for a methodologist such as myself to need to fit logistic regression or whatever and not want to have to waste a lot of time on it, which Stata is fantastic for. Stata 11 looks to be even better, as it incorporates procedures such as Multiple Imputation easily. The sheer amount of time spent doing MI followed by logistic regression (or whatever) is irritating. Stata speeds that up. Also when you own Stata you own it all and the upgrade pricing is quite reasonable. Tech support is also solid.
SPSS has a few gems in its otherwise incomprehensible mass of utter bilge. IMO it’s a company with highly predatory licensing, too.
R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility. I like it for class because it’s free, there are nice books for it, and it lets me bypass IT as it’s possible to put a working R system on a USB drive. I love the graphics.
Matlab has made real strides as a programming language and has superb numerics in it (or did), at least according to the numerics people I know (including my numerical analysis professor). However, Statistics Toolbox is iffy in terms of what procedures it supports, though it might have been updated. Graphics are also nice. But it is expensive.
Mathematica is nice for symbolic calculation. With the MathStatica addon (sadly this has been delayed for an unconscionable amount of time) it’s possible to do quite sophisticated theoretical computations. It’s not a replacement for your theoretical knowledge, but is very helpful for doing all the inaccurate and tedious calculations necessary.
-
Brett D wrote:I started in Matlab, moved on to R, looked at Octave, and am just getting into SciPy.
27. July 2009 at 10:58 am :
Matlab is good for linear algebra and related multivariate stats. I could never get any nice plotting out of it. It can do plenty of things I never learnt about, but I can’t afford to buy it, so I can’t use it now anyway.
R is powerful, but can be very awkward. It can write jpeg, png, and pdf files, make 3D plots and nice 2D plots as well. Two things put me off it: it’s an absolute dog to debug (how does “duplicate row names are not allowed” help as an entire error message when I’ve got 1000 lines of code spread between 4 functions?), and its data types have weird eccentricities that make programming difficult (like transposing a data frame turns it into a matrix, and using sapply to loop over something returns a data frame of factors… I hate factors). There are a lot of packages that can do some really nice things, although some have pretty thin documentation (that’s open source for you).
Octave is nicer to use than R ( = Matlab is nicer to use than R), but I found it lacking in most things I wanted to do, and the development team seem to wait for something to come out in Matlab before they’ll do it themselves, so they’re always one step behind someone else.
I’m surprised how quickly I’m picking up SciPy. It’s much easier to write, read and debug than R, and the code looks nicer. I haven’t done much plotting yet, but it looks promising. The only trick with Python is its assignments for mutable data types, which I’m still getting my head around.
-
Mike wrote:Mathematica is also able to link to R via a third party add-on distributed by ScienceOps. The numeric capabilities of Mathematica were “ramped” up 6 years ago so should be thought of as more than a symbolic (only) environment. Further info here:
29. July 2009 at 9:45 pm :
http://reference.wolfram.com/mathematica/note/SomeNotesOnInternalImplementation.html#28959
(I work for Wolfram Research)
-
R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility.
Hah!
Everyone really likes Stata. Interesting.
-
I use Python/Matlab for most analysis, but Mathematica is really nice for
building demos and custom visualization interfaces (and for debugging your
formulas)
For instance, here’s an example of taking some mutual fund data, and visualizing those mutual funds (from 3 different categories) in a Fisher Linear Discriminant transformed space (down to 3 dimensional from initial 57 or so)
http://yaroslavvb.com/upload/strands/dim-reduce/dim-reduce.html
-
A post on R vs. Matlab: To R or
not to R
-
Also, a discussion looking for solutions that are both fast to prototype and
fast to execute: suitable
functional language for scientific/statistical computing
-
Cristian wrote:I do not understand why SAS is so much hailed here because it handles large datasets. I use Matlab almost exclusively in finance and when I have problems with how large the data sets are then I don’t use SAS by I use mysql server instead. Matlab can talk to mysql server and thus I do not see why SAS is needed in this case.
1. September 2009 at 3:21 am :
-
Mike wrote:I have used Stata and R but for my purposes I actually prefer and use Mathematica. Unsurprisingly nobody has discussed its use so I guess I will.
11. September 2009 at 6:38 am :
I work in ecology and I use Mathematica almost exclusively for modeling. I’ve found that the the elegance of the programming language lends itself to easily using it for statistical analysis as well. Although it isn’t really a statistics package being able to generate large amounts of data and then process them in the same place is extremely useful. To make up for the lack of built in statistical analysis I’ve built my own package over time by collecting and refining the tests I’ve used.
For most people I would say using Mathematica for statistics is way more work than it is worth. Nevertheless, those who already use it for other things may find it is more than capable of performing almost any data analysis you can come up with using relatively little code. The addition of functionality targeted at statistics in versions 6 and 7 has made this use simpler, although the built in ANOVA package is still awkward and poorly documented. One thing it and Matlab beat other packages at hands down is list/matrix manipulation which can be extremely useful.
-
Paul Kim wrote:I am using MATLAB along with SPSS. Does anyone know about how to connect SPSS with MATLAB? Or can we use any form of programming (e.g., “for” loops and “if”) in SPSS to connect with MATLAB?
14. September 2009 at 9:10 pm :
Thank you.
Paul
-
Mattia wrote:I worked at the International Monetary Fund so I thought I’d add the government perspective, which is pretty much the same as the business one. You need software that solves the following equation
25. September 2009 at 1:39 pm :
maximize amount of useful output
such that: salaries of staff * hours worked - cost of software < budget
It turns out IMF achieves that by letting every economist work with whatever they want. As a matter of fact, economists end up using Stata.
Consider that most economics datasets are smaller than 1Gb. Stata MultiProcessor will work comfortably with up to 4Gb on the available machines. Stata has everything you need for econometrics, including a matrix language that is just like Matlab and state of the art maximum likelihood optimization, so you can create your own “odd” statistical estimators. Programming has a steeper learning curve than Matlab but once you know the language it’s much more powerful, including very nice text data support and I/O (not quite python, but good enough). If you don’t need some of the fancy add-on packages that engineers use, like say “hydrodynamics simulation”, that’s all you need. But most importantly importing, massaging and cleaning data with Stata is so unbelievably efficient that every time I have to use another program I feel like I am walking knee-deep in mud.
So why do I have to use other programs, and which?
IMF has one copy of SAS that we use for big jobs, such as when I had 100Gb of data. I won’t dwell on this because it’s been covered above, but in general SAS is industrial-grade stuff. One big difference between SAS and other programs is that SAS will try to keep working when something goes wrong. If you *need* numbers for the next morning, you go to bed, the next morning you come and Stata has stopped working because of a mistake. SAS hasn’t, and perhaps your numbers are garbage, but if you are able to tell that they are simply 0.00001% off then you are in perfectly good shape to make a decision.
Occasionally I use Matlab or Gauss (yes, Gauss!) because I need to put the data through some black box written in that language and it would take too long to understand it and rewrite it.
That’s all folks. Thanks for the attention.
-
Mattia wrote:No that was not all, I forgot one thing. Stata can map data using a free user-written add-in (spmap), so you can save yourself the time of learning some brainy GIS package. Does anyone know whether R, SAS, SPSS or other programs can do it?
25. September 2009 at 6:42 pm :
-
R has some packages for plotting geo data, including “maps”, “mapdata”, and
also some ggplot2 routines. Now I just saw an entire “R-GIS” project, so I’m
sure there’s a lot more related stuff for R…
-
مقایسه بستههای تحلیل داده (R, Matlab, SciPy, Excel, SAS, SPSS, Stata) « دنیای پیرامون wrote:[...] اینکه ببینم کدوم مناسبتره شروع به مقایسه کردم. توی یک وبلاگ یک مقایسه ساده و البته تقریبا عمیقی پیدا کردم. اون رو [...]
30. September 2009 at 6:59 am :
-
Tao Wu wrote:Hi, all. I think I should mention about a C++ framework based software, named as ROOT. see http://root.cern.ch
30. September 2009 at 5:56 pm :
You will see ROOT is definitely better than R.
-
Tao Wu wrote:As I can see, the syntax and grammar of R are really stupid. I can not image that R, S, S+ have been widely used by financial bodies. Furthermore, they are trying to claim they are very professional and very good at financial data analysis. I can predict that if they shift to ROOT (a real language with C++), they will see the power of data analysis.
30. September 2009 at 5:59 pm :
-
xin (April 19) writes:
> the majority of ‘R’ers on this thread act like a bunch of rebellious teens …
Well spotted — I’ve been a rebellious teen for decades now.
-
Wei Zhang wrote:People in my work place, an economic research trust, love STATA. Economists love STATA and they ask new comers to use STATA as well. R is discouraged in my work place for excuses like it is for statisticians. Sigh~~~~
10. January 2010 at 10:35 am :
But!!! I keep using it and keep discovering new ways of using it. Now, I use ‘dmsend’ function from the ‘twitteR’ package to inform me the status of my time-consuming simulations while I am not in office. It is just awesome that using R makes me feel bounded by nothing.
BTW, anyone knows how to use R to send emails (on various OS, Win, Mac, Unix, Linux). I googled a bit and not very promising. Any plans to develop a package?
If we had the package, we can just hit ‘paste to console’ (RWinEdt) or C-c C-c (ESS+Emacs) and let R to estimate, simulate and send results to co-authors automatically. What a beautiful world!!
I use Matlab and STATA as well but R completely owns me. Being a bad boy naturally, I start to encourage new comers to use R in my work place.
-
ynte wrote:I happened to hit this page, and I am impressed by the pro’s and con’s.
13. January 2010 at 8:30 pm :
Been using SPSS for over 30 years and I’ve been appreciating the steep increase in usability from punch card syntax to pull down menu’s. I only ran into R today because it can handle Zero Inflated Poisson Regression and SPSS can’t or won’t.
I think it is Great to find open source statistical software. I guess it requires a special ment framework to actually enjoy struggling through the command structure, but if I were 25 years younger………
It really is a bugger to find that SPSS (or whatever they like to be called) and R come up with different parameter estimates on the same dataset [at least in the negative binomial model I compared].
Is there anyone out there with experience in comparing two or more of these packages on one and the same dataset?
-
Wei wrote:@ynte
16. January 2010 at 9:58 am :
Why don’t you join R: mailing list? If you ask questions properly there, you will get answers.
I would suggest a place to start: http://www.r-project.org/mail.html
Have fun.
-
peng wrote:hi friends,
27. January 2010 at 10:22 am :
I am new to R.I would like to know R-PLUS.Does any know where can I get the free training for R-PLUS.
Regards,
Peng.
-
Wayne wrote:I use R.
12. February 2010 at 8:38 pm :
I’ve looked at Matlab, but the primitive nature of its language turns my stomach. (I mean, here’s a language that uses alternating strings and values to imitate named parameters? A language where it’s not unusual to have a half page of code in a routine dedicated to filling in parameters based on the number of supplied arguments.) And the Matlab culture seems to favor Perleqsue obfuscation of code as a value. Plus it’s expensive. It’s really an engineer’s tool, not a statistician’s tool.
SAS creeps me out: it was obviously designed for punched cards and it’s an inconsistent mix of 1950’s and 1960’s languages and batch command systems. I’m sure it’s powerful, and from what I’ve read the other statistics packages actually bend their results to match SAS’s, even when SAS’s results are arguably not good. So it’s the Gold Standard of Statistics ™, literally, but it’s not flexible and won’t be comfortable for someone expecting a well-designed language.
R’s language has a good design that has aged well. But it’s definitely open source: you have two graphical languages that come in the box (base and lattice), with a third that’s a real contender (ggplot2). Which to choose? There are over 2,000 packages and it takes a bit of analysis just to decide which of the four Wavelet packages you want to use for your project — not just current features, but how well maintained the package appears to be, etc.
There are really three questions to answer here: 1) What field are you working in, 2) How focused are your needs, and 3) What’s your budget?
In engineering (and Machine Learning and Computer Vision), 95% of the example code you find in articles, online, and in repositories, will be Matlab. I’ve done two graduate classes using R where Matlab was the “no brainer” choice, but I just can’t stomach Matlab “programming”. Python might’ve been a good choice as well, but with R I got an incredible range of graphics combined with multiple a huge variety of statistical and learning techniques. You can get some of that in Python, but it’s really more of a general-purpose tool when you definitely have to roll your own.
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata – Brendan O… – Lukas and I were trying to write a succinct
comparison of the most popular packages that are typically used for data
analysis. I think most people choose one based on what people around them use or
what they learn in school, so I’ve found it hard to find comparative
information. I’m posting the table here in hopes of useful comments. [...]
-
Jay wrote:Yeah, quite the odd list. If *Py stuff is in there, then PDL definitely should be too.
17. February 2010 at 9:43 am :
-
[...] Comparison of data analysis packages from Brendan O’Connor [...]
-
stat_stuff wrote:i like what you wrote to describe spss, clear and consise….nuf said :-)
25. February 2010 at 10:24 am :
-
forkandwait wrote:I would like to comment on SAS versus R versus Matlab/ Octave.
27. February 2010 at 12:05 am :
SAS seems to excel at data handling, both with large datasets and with wacked proprietary formats (how else can you read a 60GB text file and merge it with an access database from 1998). It is really ugly though, not interactive/ exploratory, and graphics aren’t great.
R is awesome because it is a fully featured language (things like named parameters, object orientation, typing) etc, and because every new data analysis algorithm probably gets implemented in it first these days. I rather like the graphics. However, it is a mess, with bad naming conventions that have evolved badly over time, conflicting types, etc.
Matlab is awesome in its niche, which is NOT data analysis, but rather math modeling with scripts between 10 and 1000 lines. It is really easy to get up an running if you have a math (ie linear algebra) background, the function file system is great for a medium level of software engineering, plotting is awesome and simpler than R, the datatypes (structs) are complex enough but dont’ involve the headaches of a “well developed” type system. If you are doing data management, gui interaction, or dealing with categorical data, it might be best to use SQL/ SAS or something else and export your data into matrices of numbers.
I would like numpy and friends, but ZERO BASED INDEXING IS NOT MATHEMETICAL.
Just my 2c
-
anlaystenheini wrote:This is a great compilation, thank you.
16. April 2010 at 4:52 pm :
After working as an econometrics analyst for a while mainly using stata, I can tell the following about STATA:
Stata is relativly easy to get startet with and to produce some graphics quickly (that’s what all the business people want, click click here’s your powerpoint presentation with lots of colourful graphics and no real content).
BUT if you want to automate things and if you want to make stata to do things it isn’t capable of out of the box, it is pure pain!
The big problem is: On one hand Stata has a scripting/command interface, which is not very powerful and very very inconsistent. On the other Hand, stata has a fully featured matrix-orientated programming language with c-like syntax, which is c-like, therefore not very handy (c is old and not made for mathematics, the matlab language is much more convenient), and which doesn’t work well with the rest of stata (you have a superflous level for interchanging data from one part to the other).
All together programming STATA feels like persuading STATA:
Error messages are almost useless, the macro text expansion used in the scripting language is not very suitable for things that has to do with mathematics (texts can’t calculate), and many other little things.
It is very inconsitent sometimes very clumsy to handle and has silly limitations like string expressions limited to 254 chars like in the early 20th century.
So go with stata for a little ad hoc statistics but do not use it for more sophisticated stuff, in that case learn R!
-
George Wolfe wrote:I’ve used Mathematica as a general purpose programming language for the past couple of years. I’ve built a portfolio optimizer, various tools to manipulate data and databases, and a lot of statistics and graphs routines. People who use commercial portfolio optimizers are always surprised at how fast the Mathamatica optimizations run - faster then their own optimizers. Based on my experience, I can say that Mathematica is great for numerical and ordinary computational tasks.
19. April 2010 at 11:13 pm :
I did have to spent a lot of time learning how to think in Mathematica - it’s most powerful when used as a functional language, and I was a procedural programmer. However, if you want to use a procedural programming approach, Mathematica supports that.
Regarding some of the other topics discussed above: (1) Mathematica has build in support for parallel computing, and can be run on supercomputing clusters (Wolfram Alpha is written in Mathematica). (2) The language is highly evolved and is being actively entended and improved every year. It seems to be in an exponential phase of development currently - Stephen Wolfram outlines the development plans every year and the annual user conferenced - and his expectations seem to be pretty much on target. (3) Wolfram has a stated goal of making Mathematica a universal computing platform which smoothly integrates theoretical and applied mathematics with general purpose, graphics, and computation. I admit to a major case of hero worship, but I think he is achiving this goal.
I’m going on and on about Mathematica because, in spite of its wonderfulness, it doesn’t seem to have taken it’s rightful place in these discussions. Maybe Mathematica users drop out of the “what’s the best language for x” after they start using it. I don’t know, really. But anyway, that’s the way I see it.
-
Dale wrote:I am amazed that nobody has mentioned JMP. It is essentially equivalent to SPSS or STATA in capabilities but far easier to use (certainly to teach or learn). The main reason why it is not so well known is that it is a SAS product and they don’t want to market it well for fear that nobody will want SAS any more.
25. April 2010 at 12:54 am :
-
ad wrote:In the comparison I did not see Freemat. This is a open source tool that follows along the lines of MATLAB. It would interesting to see how the community compares Freemat to Matlab
25. April 2010 at 1:23 pm :
-
bupka’s online menyediakan buku terpakai (used books) berkualitas dan
asli
original dengan harga miring,banyak buku teknik. silahkan kunjungi
http://bupka.wordpress.com
buku MATLAB yg dibicarakan diatas, ada stok saat ini.
silahkan liat2 lainnya juga.
-
Farhat wrote:@Wolfe: I have used Mathematica a lot over the past 8 years and still use it for testing ideas as small pieces of code can do fairly sophisticated stuff, I’ve found it poor for large datasets and longer code development. It even lacked things like support for a code versioning system until recently. The cost is also a major detractor. Mathematica costs like $2500 or so last time I checked. Also, some of the newer features like Manipulate seem to create issues, I had a small piece of code using that for interactivity which sent the CPU usage to 100% regardless of whether any change was happening or not.
27. April 2010 at 9:37 am :
Also, SAGE ( http://www.sagemath.org ), the open source alternative to Mathematica has gotten quite powerful in the last few years.
-
I just wanted to mention that Maple, which has not been commented on yet in
this post or in the subsequent thread, generates beautiful visuals and I used to
program in it all the time (as an alternative to Mathematica which was used by
the “other camp” and I wouldn’t touch).
Also, I’m starting to use Matlab now and loving how intuitive it is (for someone with programming experience anyway). st
-
Jason wrote:let me quote some of Ross Ihaka’s reflection on R’s efficiency….
9. May 2010 at 5:40 pm :
“I’m one of the two originators of R. After reading Jan’s
paper I wrote to him and said I thought it was interesting
that he was choosing to jump from Lisp to R at the same
time I was jumping from R to Common Lisp……
We started work on R in the early ’90s. At the time
decent Lisp implementations required much more resources
than our target machines had. We therefore wrote a small
scheme-like interpreter and implemented over that.
Being rank amateurs we didn’t do a great job of the
implementation and the semantics of the S language which
we borrowed also don’t lead to efficiency (there is a
lot of copying of big objects).
R is now being applied to much bigger problems than we
ever anticipated and efficiency is a real issue. What
we’re looking at now is implementing a thin syntax over
Common Lisp. The reason for this is that while Lisp is
great for programming it is not good for carrying out
interactive data analysis. That requires a mindset better
expressed by standard math notation. We do plan to make
the syntax thin enough that it is possible to still work
at the Lisp level. (I believe that the use of Lisp syntax
was partially responsible for why XLispStat failed to gain
a large user community).
The payoff (we hope) will be much greater flexibility and
a big boost in performance (we are working with SBCL so
we gain from compilation). For some simple calculations
we are seeing orders of magnitude increases in performance
over R, and quite big gains over Python…..”
the full post is here:
http://r.789695.n4.nabble.com/Ross-Ihaka-s-reflections-on-Common-Lisp-and-R-td920197.html#a920197
it is quite interesting to note that such a “provactive” post from one of R’s originators got 0 response from R-dev list………..
-
Business Intelligence Tools: looking at R as a platform for big BI. - SkriptFounders wrote:[...] is some more information I thought was nice on the best packages for stat analysis. The only thing thats wrong here is the [...]
23. May 2010 at 5:36 am :
-
Sam wrote:I came across this thread and I’m finding the comments very useful. Thanks to all!
16. June 2010 at 4:12 pm :
I’m trying to decide which software package to use. I’m a researcher working with clinical (patient-related) data. I have data sets with <10,000 rows (usually just a few thousand). I need software that will generate multivariate and logistic regression, and Kaplan-Meier survival curves. Visualization is very important.
Of note, I’m an avid programmer as a hobby (C++, assembly, most anything), so I’m very comfortable with a more complex package, but I need something that just works. I’ve been using SPSS, which works, but clunky.
Any suggestions? Stata? Systat? S-Plus? Maple?
-
I still haven’t used Stata, but its users have very strong praise for it, for
situations that sound like yours. That might be the best option to start
with.
R might be worth trying too.
-
Rashad wrote:I am working on my undergraduate degree in statistics in the SAS direction, which has surprised people in the field I meet. The choice was somewhat arbitrary; I just wanted something applied to complement a pure mathematics degree. This post has opened many (…..many) options to consider. Thanks for the great discussion.
5. July 2010 at 12:48 am :
-
Donovan wrote:So my question here is simple:
2. August 2010 at 2:44 am :
After you peel back all the layers and look at the solution that would require the least effort, the most power, with the greatest flexibility, why anyone would choose anything other than RPy first, and then the language du joire that your employer would be using second as a backup and scrap the code war?
I mean for my money, you make sure you can build a model in Excel, learn RPy & C# and search for APIs if you need to user other languages or just plain partner with someone who can code C++ {if you can’t} and simply inject it.
I mean I plan on learning Java, PHP and SAS as well, but that is really a personal choice. Coming from IT with in Finance, not knowing Java and SAS means you either won’t get in the door or reach a glass ceiling pretty quickly unless you play corporate politics really, really well. So for me, it is a necessity. But the flip side is, wanting to make the leap into Financial Engineering after completing a doctorate in Engineering, RPy has also become a near Realistically, unless you just like coding, I have to say that what I have suggested makes the most sense for the average analysis pro. But then alot of this is based upon whether you’re a Quant Research, Quant Developer, Analyst, etc. — different tools for different functions.
Just thought
-
Mark Smith wrote:Sas and r
14. August 2010 at 11:06 pm :
1. there is a book out on the topic (http://www.amazon.com/gp/product/1420070576?ie=UTF8&tag=sasandrblog-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1420070576)
2. R interface available in SAS 9.2
“While SAS is committed to providing the new statistical methodologies that the marketplace demands and will deliver new work more quickly with a recent decoupling of the analytical product releases from Base SAS, a commercial software vendor can only put out new work so fast. And never as as fast as a professor and a grad student writing an academic implementation of brand-new methodology.
Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.
“We know a lot of our users have both R and SAS in their toolkit, and we decided to make it easier for them to access R by making it available in the SAS 9.2 environment,” said Rodriguez.
The SAS/IML Studio interface allows you to integrate R functionality with SAS/IML or SAS programs. You can also exchange data between SAS and R as data sets or matrices.
“This is just the first step,” said Radhika Kulkarni, Vice President of Advanced Analytics. “We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure.“
http://support.sas.com/rnd/app/studio/Rinterface2.html
While this is probably more for SAS users than R, I thought both camps might be interested in case you get coerced into using SAS one day… doesn’t mean you have to give up your experience with R.
-
Iskander wrote:I am also amazed how few people here have said anything about StatSoft Statistica. I’ve been using it for close to 6 years and don’t see any shortcomings at all. Consider this:
26. August 2010 at 4:51 pm :
- full support of R
- fully scriptable, which means you can call DLLs written in whatever programming language possible and implementing things which you didn’t find inbuilt in Statistica (which doesn’t mean it’s not there)
- the Statistica solver / engine can be called externally from Excel and other applications via the COM/OLE interface
- untrammelled graphics of virtually any complexity — extremely flexible and customizable (and scriptable)
- the Data Miner (with its brand new ‘Data Miner Recipes’) is another extremely powerful tool that leaves only your imagination to limit you
….it would be tedious to list all its advantages (again, the Statistica Neural Networks and the Six Sigma modules are IMO very professionally implemented).
-
ZZ wrote:No package other than sas can load the unstructured data like blogs posted here, analyze and extract the sentiments (positive, negative, neutral) about each of the packages debated here in pretty decent precision after sas bought teragram a few years ago.
31. August 2010 at 12:08 pm :
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata – Brendan O�… Excellent comparison between data analysis
packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata. (tags: python r matlab)
[...]
-
Interesting Comparison of data analysis packages - CCPR Computing wrote:[...] http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-st... Uncategorized none [...]
23. September 2010 at 10:31 pm :
-
[...] 1. Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata [...]
-
John wrote:A post above commented: “sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?” A good relational database is good at supporting online transactional processing and will in most organizations come with a bureaucracy of gatekeepers whose role is to ensure the integrity of the database to support mission critical transactional applications. In other words it takes a mountain of paperwork to merely add one field to a table. The paradigm assume a business area of ‘users’ who have their requirements spelled out before anyone even thinks of designing let alone programming anything. It just kills analysis. Where SAS is used data must be extracted from such systems and loaded into text files for SAS to read, or SAS/Access used. Generally DBAs are loath to install the latter as it is difficult to optimize in the sense of minimizing the drain on operational systems.
17. October 2010 at 4:40 am :
On IBM mainframes the choice of languages to use is limited and by default this will usually be SAS. Most large organisations have SAS, at least Base SAS, installed by default because the Merrill MXG capacity planning software uses it. Hence cost is sort of irrelevant. It then tends to be used for anything requiring processing of text files even in production applications and this often means processing text as text, e.g. JCL with date dependent parameters, rather than as preparing data for loading into SAS datasets for statistical analysis.
I know nothing about R but seeing a few code sample it struck me how it resembled APL to which we were introduced in our stats course in college in the early 70s, not surprising as both are matrix oriented.
AI and Social Science - Brendan O’Connor
Cognition, systems, decisions, visualization, machine
learning, etc.
About
This is a blog on artificial intelligence and social science —
call it "Social Science++" — with an emphasis on computation and statistics. My
general website is anyall.org.
All Posts
Best posts are bold.- An ML/AI approach to P != NP
- Updates: CMU, Facebook
- quick note: cer et al 2010
- How Facebook privacy failed me
- List of probabilistic model mini-language toolkits
- Seeing how “art” and “pharmaceuticals” are linguistically similar in web text
- Quiz: “art” and “pharmaceuticals”
- Don’t MAWK AWK - the fastest and most elegant big data munging language!
- Patches to Rainbow, the old text classifier that won’t go away
- Another R flashmob today
- Beautiful Data book chapter
- Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features
- Blogger to Wordpress migration helper
- R questions on StackOverflow
- FFT: Friedman + Fortran + Tricks
- Road in a forest
- Beta conjugate explorer
- Michael Jackson in Persepolis
- Psychometrics quote
- June 4
- Where tweets get sent from
- Zipf’s law and world city populations
- Announcing TweetMotif for summarizing twitter topics
- Performance comparison: key/value stores for language model counts
- 1 billion web page dataset from CMU
- Pirates killed by President
- Binary classification evaluation in R via ROCR
- Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
- La Jetee
- “Logic Bomb”
- SF conference for data mining mercenaries
- Love it and hate it, R has come of age
- Facebook sentiment mining predicts presidential polls
- Can social media prevent genocide?
- Statistics vs. Machine Learning, fight!
- Calculating running variance in Python and C++
- Python bindings to Google’s “AJAX” Search API
- Netflix Prize
- The Wire: Mr. Nugget
- Correlations - cotton picking vs. 2008 Presidential votes
- Disease tracking with web queries and social messaging (Google, Twitter, Facebook…)
- Obama street celebrations in San Francisco
- Twitter graphs of the debate
- Is religion the opiate of the elite?
- Financial market theory on the Daily Show
- The Universal Declaration of Human Rights Animated
- It is accurate to determine a blog’s bias by what it links to
- Blog move has landed
- MyDebates.org, online polling, and potentially the coolest question corpus ever
- PalinSpeak.com
- "Machine" translation/vision (Stanford AI courses online)
- Fukuyama: Authoritarianism is still against history
- A better Obama vs McCain poll aggregation
- East vs West cultural psychology!
- The MacGyver of data analysis
- Link: Today’s international organizations
- Bias correction sneak peek!
- Turker classifiers and binary classification threshold calibration
- Pairwise comparisons for relevance evaluation
- Clinton-Obama support visualization
- Sub-reddit for Systems Science and OR
- conplot - a console plotter
- The best natural language search commentary on the internet
- Are women discriminated against in graduate admissions? Simpson’s paradox via R in three easy steps!
- a regression slope is a weighted average of pairs’ slopes!
- Datawocky: More data usually beats better algorithms
- Allende’s cybernetic economy project
- Quick-R, the only decent R documentation on the internet
- Spending money on others makes you happy
- color name study i did
- PHD Comics: Humanities vs. Social Sciences
- data data data
- Food Fight
- Graphics! Atari Breakout and religious text NLP
- Moral psychology on Amazon Mechanical Turk
- Will the humanities save us?
- Indicators of a crackpot paper
- What is experimental philosophy?
- Data-driven charity
- Race and IQ debate - links
- How did Freud become a respected humanist?!
- Actually that 2008 elections voter fMRI study is batshit insane (and sleazy too)
- Pop cog neuro is so sigh
- Authoritarian great power capitalism
- neo institutional economic fun!
- Verificationism dinosaur comics
- EEG for the Wii and in your basement
- Dollar auction
- ConnectU.com SQL injection vulnerability: a story of pathetic hubris (and fun with the password ‘password’)
- It’s all in a name: "Kingdom of Norway" vs. "Democratic People’s Republic of Korea"
- When’s the last time you dug through 19th century English mortuary records
- Are ideas interesting, or are they true?
- Cooperation dynamics - Martin Nowak
- China: fines for bad maps
- Cerealitivity
- Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!
- Washington in 1774
- Happiness incarnate on the Colbert Report
- Evangelicals vs. Aquarians
- "Time will tell, epistemology won’t"
- Richard Rorty has died
- Freak-Freakonomics (Ariel Rubinstein is the shit!)
- "Stanford Impostor"
- Rock Paper Scissors psychology
- Simpson’s paradox is so totally solved
- More fun with Gapminder / Trendalyzer
- Gapminder.org — terrific world data visualizations
- Random search engine searcher
- Evil
- Seth Roberts and academic blogging
- Statistics is big-N logic?
- Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do
- Computability and induction and ideal rationality and the simpsons
- Iraq is the 9th deadliest civil war since WW2
- Pascal’s Wager
- When linguists appear on ironic parody talk shows
- The Jungle Economy
- funny comic
- Anarchy vs. social order in Somalia
- Double thesis action
- A big, fun list of links I’m reading
- 4-move rock, paper, scissors!
- Two Middle East politics visualizations
- neuroscience and economics both ways
- Social network-ized economic markets
- Rock, Paper, Scissors
- Neuroeconomics reviews
- Lordi goes to Eurovision
- Drunken monkeys experiment!
- Easterly vs. Sachs on global poverty
- high irony
- The identity politics of satananic zombie alien man-beasts
- new kind of science, for real
- Mark Turner: Toward the Founding of Cognitive Social Science
- Libertarianism and evolution don’t mix
- academic blogging
- science writing bad!
- Bush approval ratings
- Kurzweil interview
- cognitive modelling is rational choice++
- Submit your poker data!
- Bayesian analysis of intelligent design (revised!)
- searchin’ for our friend, homo economicus
- balkanized USA
- war death statistics
- guns, germs, & steel pbs show?!
- the psychology of design as explanation
- another blog: cog psych and political/social stuff
- a bayesian analysis of intelligent design
- Statistical inference and social science
- finding some decision science blogs
- Social economics and rationality
- City crisis simulation (e.g. terrorist attack)
- freakonomics blog
- Supreme Court justices’ agreement levels
- $ echo {political,social,economic}{cognition,behavior,systems}
- Modelling environmentalism thinking
- monkey economics (and brothels)
- more argumentation & AI/formal modelling links
- zombies!
- looking for related blogs/links
- idea: Morals are heuristics for socially optimal behavior
- 1st International Conference on Computational Models of Argument (COMMA06)
- Online Deliberation 2005 conference blog & more is up!
- go science
- addiction & 2 problems of economics
- gintis: theoretical unity in the social sciences
Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.Name | Advantages | Disadvantages | Open source? | Typical users |
R | Library support; visualization | Steep learning curve | Yes | Finance; Statistics |
Matlab | Elegant matrix support; visualization | Expensive; incomplete statistics support | No | Engineering |
SciPy/NumPy/Matplotlib | Python (general-purpose programming language) | Immature | Yes | Engineering |
Excel | Easy; visual; flexible | Large datasets | No | Business |
SAS | Large datasets | Expensive; outdated programming language | No | Business; Government |
Stata | Easy statistical analysis | No | Science | |
SPSS | Like Stata but more expensive and worse |
There’s a bunch more to be said for every cell. Among other things:
- Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
- Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
- Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
- Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
- I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
- Here’s an interesting reddit thread on SAS/Stata vs R.
- SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
- SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
- Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
- But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
- If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
- (This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
- SAS people complain about poor graphing capabilities.
- R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
- Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
- Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
- Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
- Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
- I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.
•
114 comments to “Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata”
-
Eric Sun wrote:>>I know dozens of people under 30 doing statistical stuff and only one knows SAS.
23. February 2009 at 8:53 pm :
I’m assuming the “one” is me, so I’ll just say a few points:
I’m taking John Chambers’s R class at Stanford this quarter, so I’m slowly and steadily becoming an R convert.
That said, I don’t think anything besides SAS can do well with datasets that don’t fit in memory. We used SAS in litigation consulting because we frequently had datasets in the 1-20 GB range (i.e. can fit easily on one hard disk but difficult to work with in R/Stata where you have to load it all in at once) and almost never larger than 20GB. In this relatively narrow context, it makes a lot of sense to use SAS: it’s very efficient and easy to get summary statistics, look at a few observations here and there, and do lots of different kinds of analyses. I recall a Cournot Equilibrium-finding simulation that we wrote using the SAS macro language, which would be quite difficult in R, I think. I don’t have quantitative stats on SAS’s capabilities, but I would certainly not think twice about importing a 20 GB file into SAS and working with it in the same way as I would a 20 MB file.
That said, if you have really huge internet-scale data that won’t fit on one hard drive, then SAS won’t be too useful either. I’ll be very interested if this R + Hadoop system ever becomes mature: http://www.stat.purdue.edu/~sguha/rhipe/
In my work at Facebook, Python + RPy2 is a good solution for large datasets that don’t need to be loaded into memory all at once (for example, analyzing one facebook network at a time). If you have mutliple machines, these computations can be speeded up using iPython’s parallel computing facilities.
Also, R’s graphical capabilities continue to surprise me; you can actually do a lot of advanced stuff. I don’t do much graphics, but perhaps check out “R Graphics” by Murrell or Deepayan Sarkar’s book on Lattice Graphics.
-
Eric Sun wrote:I thought that most people consider SAS to have the highest learning curve, certainly higher than R. but maybe I’m mistaken about that.
23. February 2009 at 8:55 pm :
-
Justin wrote:Calling scipy immature sounds somehow “wrong”. The issues you come up with are more of early design flaws that will not go away, no matter how “mature” scipy is getting.
23. February 2009 at 10:24 pm :
That said, these are flaws, but they seem pretty minor to me.
-
I’ve recently seen GNU DAP mentioned as an open-source equivalent to SAS.
Know if it’s any good?
-
TS Waterman wrote:Have you considered Octave in this regard? It’s a GNU-licensed Matlab clone. Very nice graphing capability, Matlab syntax and library functions, open source.
23. February 2009 at 10:49 pm :
http://www.gnu.org/software/octave/FAQ.html#MATLAB-compatibility
-
@Eric - oops, yeah should’ve put SAS as hardest. Good point that the standard
of judging how good large dataset support is, is whether you can manipulate a
big dataset the same way you manipulate a small dataset. I’ve loaded 1-2 GB of
data into R and you definitely have to do things differently (e.g. never use
by()).
@Justin - scipy certainly seems like it keeps improving. I just keep comparing it to matlab and it’s constantly behind. I remember once watching someone try to make a 3d plot. He spent quite a while going through various half-baked python solutions that didn’t work. Then he booted up matlab and had one in less than a minute. Matlab’s functionality is well-designed, well-put-together and well-documented.
@Edward - I have seen it mentioned too. From glancing at its home page, it seems like a pretty small-time project.
-
@TS - yeah, i used octave just once for something simple. it worked fine. my
issues were: first, i’m not impressed with gnuplot graphing. second, the
interactive environment isn’t too great. third, trying to clone the matlab
language seems crazy since it’s kind of crappy. i think i’d usually pick scipy
over octave if being free is a requirement, else go with matlab if i have access
to it.
otoh it looks like it supports some nice things like sparse matrices that i’ve had a hard time with lately in R and scipy. i guess worth another look at some point…
-
Brendan,
Nice overview, I think another dimension you don’t mention — but which Bo Cowgill alluded to at our R panel talk — is performance. Matlab is typically stronger in this vein, but R has made significant progress with more recent versions. Some benchmark results can be found at:
http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php
MD
-
Mike wrote:In high energy particle physics, ROOT is the package of choice. It’s distributed by CERN, but it’s open source, and is multi-platform (though the Linux flavor is best supported). It does solve some of the problems you mentioned, like running over large datasets that can’t be entirely memory-resident. The syntax is C++ based, and has both an interpreter and the ability to compile/execute scripts from the command line.
23. February 2009 at 11:27 pm :
There are lots of reasons to prefer other packages (like R) over ROOT for certain tasks, but in the end there’s little that can be done with other packages that one cannot do with ROOT.
-
This is obviously oversimplified - but that is the point of a succinct
comparison. I would add that you are missing a lot of disadvantages for Excel -
it has incomplete statistics support and an outdated “language” :)
Python actually really shines above the others for handling large datasets using memmap files or a distributed computing approach. R obviously has a stronger statistics user base and more complete libraries in that area - along with better “out-of-the-box” visualizations. Also, some of the benefits overlap - using numpy/scipy you get that same elegant matrix support / syntax that matlab has, basically slicing arrays and wrapping lapack.
The advantages of having a real programming language and all the additional non-statistical libraries & frameworks available to you make Python the language of choice for me. If there is something scipy is weak at that I need, I’ll also use R in a pinch or move down to C. I think you are basically operating at a disadvantage if you are using the other packages at this point. The only other reason I can see to use them is if you have no choice, for example if you inherited a ton of legacy code within your organization.
-
I’m sure you’ve stirred up a lot of controversy. Thanks for calling ‘em like
you see ‘em.
As for Mathematica, I haven’t used it for statistics beyond some basic support for common distributions. But one thing it does well is very consistent syntax. I used it when it first came out, then didn’t use if for years, and then started using it again. When I came back to it, I was able to pick it up right where I left off. I can’t put R down for a week and remember the syntax. Mathematica may not do everything, but what it does do, it does elegantly.
-
it would be awesome to have an informal, hands on tutorial comparison of
several of these languages (looking at ease, performance, features, etc.). maybe
a meetup at something like super
happy dev house, or even something separate. just a thought!
-
@Michael Driscoll - good point! I was afraid to make performance claims since
I’ve heard that Matlab is getting faster, they have a JIT or a nice compiler or
something now, and I haven’t used it too much recently. (That benchmark page
doesn’t even say which matlab version was used, though I emailed the guy…) I’m
also suspicious of performance comparisons since I’d expect much of it to be
very dependent on the matrix library and there are several LAPACKs out there
(ATLAS and others) and many compiletime parameters to fiddle with. I think I
read something claiming many binary builds of R don’t use the best LAPACK they
could. I’m not totally sure of this though. But if it’s true that Matlab knows
how to vectorize for-loops, that’s really impressive.
@Mike - ah yes, i remember looking at ROOT a long time ago and thinking it was impressive. But then I forgot about it because all the cs/stats people whose stuff I usually read don’t know about it. I think it just goes to show you that the data analysis tools problem is tackled so differently by different groups of people, it’s very easy to not miss out on better options just due to lack of information!
@Pete - yeah I whine about python. but I seem to use numpy plenty still :) actually its freeness is a huge win over matlab for cluster environments since you dont’ have to pay for a zillion licenses…
Hm I seem to be talking myself into thinking it’s down to R vs Python vs Matlab. then the rosetta stone http://mathesaurus.sourceforge.net/matlab-python-xref.pdf should be my guide…
@John - very interesting. I think many R users have had the experience of quickly forgetting how to do basic things.
-
From David Knowles, who did the comparison Mike Driscoll linked to
(http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php):
> Nice comparison. I would add to the pros of R/Python that the data
> structures are much richer than Matlab. The big pro of Matlab still
> seems to be performance (and maybe the GUI for some people). On top of
> being expensive Matlab is a nightmare if you want to run a program on
> lots of nodes because you need a license for every node!
>
> It’s 2008b I did the comparison with - I should mention that!
-
From Rob Slaza’s
statistics toolbox tutorials, it *seems* like using MATLAB for stats is
reasonably simple…
-
On top of being expensive Matlab is a nightmare if you want to run a program on lots of nodes because you need a license for every node!
@Brendan:
Re David Knowles’ comment…
There are specialized parallel/distributed computing tools available from MathWorks for writing large-scale applications (for clusters, grid etc.). You should check out: http://www.mathworks.com/products/parallel-computing.
Running full-fledged desktop MATLAB on a huge number of nodes is messy and of course very expensive not to mention that a single user would take away several licenses for which other users will have to wait.
Disclosure: I work for the parallel computing team at The MathWorks
-
Another guy from Mathworks, their head of Matlab product management Scott
Hirsch, contacted me about the language issue and was very kind and
clarifi-cative. The most interesting bits below.
On Tue, Feb 24, 2009 at 7:20 AM, Scott Hirschwrote:
>> Brendan –
>>
>> Thanks for the interesting discussion you got rolling on several popular
>> data analysis packages
[...]
>> I’m always very interested to hear the perspectives of MATLAB users, and
>> appreciate your comments about what you like and what you don’t like. I was
>> interested in following up on this comment:
>>
>> “Matlab’s language is certainly weak. It sometimes doesn’t seem to be
>> much more than a scripting language wrapping the matrix libraries. “
>>
>> I have my own assumptions about what you might mean, but I’d be very
>> interested in hearing your perspectives here. I would greatly appreciate it
>> if you could share your thoughts on this subject.
>
> sure. most of my experiences are with matlab 6. just briefly,
>
> * leave out semicolon => print the expression. that is insane.
> * each function has to be defined in its own file
> * no optional arguments
> * no named arguments
> * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> * no perl/python-style hashes
> * no object orientation (or just a message dispatch system) … less
> important
> * poor/no support for text
> * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
On Tue, Feb 24, 2009 at 11:27 AM, Scott Hirschwrote:
> Thanks, Brendan. This is very helpful. Some of the things have been
> addressed, but not all. Here are some quick notes on where we are today.
> Just to be clear – I have no intention (or interest) in changing your
> perspectives, just figured I could let you know in case you were curious.
>
>
>
> > * leave out semicolon => print the expression. that is insane.
> No plans to change this. Our solution is a bit indirect, but doesn’t break
> the behavior that lots of users have come to expect. We have a code
> analysis tool (M-Lint) that will point out missing semi-colons, either while
> you are editing a file, or in a batch process for all files in a directory.
>
> > * each function has to be defined in its own file
> You can include multiple functions in a file, but it introduces unique
> semantics – primarily that the scope of these functions is limited to within
> the file.
[[ addendum from me: yeah, exactly. if you want to make functions that are
shared in different pieces of your code, you usually have to do 1 function per
file. ]]
> > * no optional arguments
> Nothing yet.
>
> > * no named arguments
> Nope.
>
> > * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> We’ve had structures since MATLAB 5.
[[ addendum from me: well, structures aren't very conventional in standard
matlab style, or at least certainly not the standard library. most algorithm
functions return a tuple of variables, instead of packaging things together
into a structure. ]]
> > * no perl/python-style hashes
> We just added a Map container last year.
>
> > * no object orientation (or just a message dispatch system) … less
> important
> We had very weak OO capabilities in MATLAB 6, but introduced a modern system
> in R2008a.
>
> > * poor/no support for text
> This has gotten a bit better, primarily through the introduction of regular
> expressions, but can still be awkward.
>
> > * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
> Not much here, other than a smattering (Database Toolbox for SQL,
> miscellaneous commands for web interaction, WSDL, …)
>
> Thanks again. I really do appreciate getting your perspective. It’s
> helpful for me to understand how MATLAB is perceived.
>
> -scott
-
@Gaurav - it sure would be nice if i could see how much this parallel toolbox
costs without having to register for a login!
-
There is another good numpy/matlab comparison here:
http://www.scipy.org/NumPy_for_Matlab_Users
As of the last year, a standard ipython install ( “easy_install IPython[kernel]” ) now includes parallel computing right out of the box, no licenses required:
http://ipython.scipy.org/doc/rel-0.9.1/html/parallel/index.html
If this is going to turn into a performance shootout, then I’ll add that from what I’ve seen Python with numpy/scipy outperforms Matlab for vectorized code.
My impression has been that performance order is Numpy > Matlab > R, but as my friend Mike Salib used to say - “All benchmarks are lies”. Anyway, competition is good and discussions like this keep everyone thinking about how to improve their platforms.
Also, keep in mind that performance is often a sticking point for people when it need not be. One of the things I’ve found with dynamically typed languages is that ease of use often trumps raw performance - and you can always move the intensive stuff down to a lower level.
For people who like poking at numbers:
http://www.scipy.org/PerformancePython
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14685.html
http://www.mail-archive.com/numpy-discussion@scipy.org/msg01282.html
Sturla has some strong points here:
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14697.html
-
thrope wrote:@brendano - I think it might be a case of “if you have to ask you can’t afford it” :)
25. February 2009 at 11:44 am :
-
devicerandom wrote:What about Origin (and Linux/Unix open source clones like Qtiplot)? I know a lot of people using them, and they allow fast, easy statistical analysis with beautiful graphs out of the box. Qtiplot is quite immature but it is Python-scriptable, which is a definitive plus for me -I don’t know about Origin.
25. February 2009 at 11:48 am :
-
Hi. I think this is a very incomplete comparison. If you want to make a real
comparison, it should be more complete than this
wiki article . And to give a bit of personal feedback:
I know 2 people using STATA (social science), 2 people using Excel (philosophy and economics), several using LabView (engineers), some using R (statistical science, astronomy), several using S-Lang (astronomy), several using Python (astronomy) and by using Python, I mean that they are using the packages they need, which might be numpy, scipy, matplotlib, mayavi2, pymc, kapteyn, pyfits, pytables and many more. And this is the main advantage of using a real language for data analysis: you can choose among the many solutions the one that fits you best. I also know several people who use IDL and ROOT (astronomy and physics).
I have used IDL, ROOT, PDL, (Excel if you really want to count that in) and Python and I like Python best :-)
@brendano: One other note: I think that you really have to distinguish between data analysis and data visualization. In astronomy this is often handled by completely different software. The key here is to support standardized file storage/ exchange formats. In your example the people used scipy which does not offer a single visualization routine, so you can not blame scipy for difficulties with 3D plots…
-
I am a core scipy/numpy developer, and I don’t think calling them immature
from a user POV is totally unfair. Every time someone tries
numpy/scipy/matplotlib and cannot plot something simple in a couple of minutes
is a failure of our side. I can only say that we are improving - projects like
pythonxy or enthought are really helpful too for people who want something more
integrated.
There is no denying than if you are into an integrated solution, numpy/scipy is not the best solution of the ones mentioned today - it may well be the worse (I don’t know them all, but I am very familiar with matlab, and somewhat familiar with R). There is a fundamental problem for all those integrated solutions: once you hit their limitations, you can’t go beyond that. Not being able to handle data which do not fit in memory in matlab, that’s a pretty fundamental issue, for example. Not having basic data structures (hashmap, tree, etc…) another one. Making advanced UI in matlab, not easy either.
You can build your own solution with the python stack: the numpy array capabilities are far beyond matlab’s one, for example (broadcasting, advanced indexing are much powerful than matlab current capabilities). The C API is complete, and you can do things which are simply not possible with matlab. You want to handle very big datasets ? pytables give you a database-like API on top of hdf5. Things like cython are also very powerful for people who need speed. I believe those are partially consequences of not being integrated.
Concerning the flaws you mentioned (scipy.linalg vs numpy.linalg, etc…): those are mostly legacies, or exist because removing them would be too costly. There are some efforts to remove redundancy, but not all of them will disappear. They are confusing for a newcomer (they were for me), but they are pretty minor IMHO, compared to other problems.
-
bill wrote:You forgot support and continuity. In my experience, SAS offers very good support and continuity. Others claim SPSS does, too (I have no experience there). In a commercial environment, the programs need to outlive the analyst and the whims of the academic/grad student support/development. For one-off disposable projects, R has lots of advantages. For commercial systems, not so many.
25. February 2009 at 2:29 pm :
-
Lou Pecora wrote:I’ve looked at several of the “packages” mentioned here (R, Octave, MATLAB, C, C++, Fortran, Mathematica). I’m a physicist who is often working in new fields where understanding the phenomena is the main goal. This means my colleagues and I are often developing new numerical/theoretical/data-analysis approaches. For anyone in this situation I unequivocally recommend:
25. February 2009 at 4:45 pm :
Python.
Why? Because given my situation there often are no canned routines. That means soon or later (usually sooner) I will be programming. Of all the languages and packages I’ve used Python has no equal. It is object oriented, has very forgiving run-time behavior, fast turn around (no edit, compile, debug cycles — just edit and run cycles), great built in structures, good modularity, and very good libraries. And, it’s easy to learn. I want to spend my time getting results, not programming, but I have to go through code development since often nothing like what I want to do exists and I’ve got to link the numerics to I/O and maybe some interactive things that make it easy to use and run smoothly. I’ve taken on projects that I would not want to attempt in any of the packages/languages I’ve listed.
I agree that Python is not wart-free. The version compatibility can sometimes be frustrating. “One-stop shopping” for a complete Python package is not here, yet (although Enthought is making good progress). It will never be as fast as MATLAB for certain things (JIT compiling, etc. makes MATLAB faster at times). Python plotting is certainly not up to Mathematica standards (although it is good).
However, the Python community is very nice and very responsive. Python now has several easy ways to add extensions written in C or C++ for faster numerics. And for all my desire not to spend time coding, I must admit I find Python programming fun to do. I cannot say that for anything else I’ve used.
-
There is good reason for the duplication of “linalg” in SciPy. SciPy’s brand
has more features which probably aren’t of as much use to as wide an audience,
and (perhaps more importantly) one of the requirements for NumPy is that it not
depend critically on a Fortran compiler. SciPy relaxes this requirement, and
thus can leverage a lot of existing Fortran code. At least that’s my
understanding.
-
These packages change and it’s easy to get locked-in ideas from the past. I
haven’t used Matlab since the 1990s, but the last time I used it, its I/O and
singular value decomposition was so slow that we switched to S-Plus just to
finish in our lifetimes.
Can any of these packages compute sparse SVDs like folks have used for Netflix (500K x 25K matrix with 100M partial entries)? Or do regressions with millions of items and hundreds of thousands of coefficients? I typically wind up writing my own code to do this kind of thing in LingPipe, as do lots of other folks (e.g. Langford et al.’s Vowpal Wabbit, Bottou et al.’s SGD, Madigan et al.’s BMR).
What’s killing me now is scaling Gibbs samplers. BUGS is even worse than R in terms of scaling, but I can write my own custom samplers that fly in some cases and easily scale. I think we’ll see more packages like Daume’s HBC for this kind of thing.
R itself tends to just wrap the real computing in layers of scripts to massage data and do error checking. The real code is often Fortran, but more typically C. That must be the same for SciPy given how relatively inefficient Python is at numerical computing. It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.
A problem I see with the way R and BUGS work is that they typically try to compile a declarative model (e.g. a regression equation in R’s glm package or a model specification in BUGS), rather than giving you control over the basic functionality (optimization or sampling).
The other thing to consider with these things from a commercial perspective is licensing. R may be open source, but its Gnu license means we can’t really deploy any commercial software on top of it. Sci-Py has a mixed bag of licenses that is also not redistribution friendly. I don’t know what licensing/redistribution looks like for the other packages.
@bill Support and continuity (by which I assume you mean stability of interfaces and functionality) is great in the core R and BUGS. The problem’s in all the user-contributed packages. Even there, the big ones like lmer are quite stable.
-
As for the rather large speed gains made by recent MATLAB releases that Lou
noted, I believe this is due in most part to their switch to the Intel
Math Kernel Library in place of a well-tuned ATLAS (I’m not completely sure
if that’s what they used before, but it’s a good bet). This hung a good number
of people with PowerPC G5’s out to dry rather quickly as newer MATLABs
apparently only run on Intel Macs (probably so they don’t have to maintain two
separate BLAS backends).
Accelerated linear algebra routines written by people who know the processors inside and out will result in big wins, obviously. You can also license the IKML separately and use it to compile NumPy (if I recall correctly, David Cournapeau who commented above was largely responsible for this capability, so bravo!). I figure it’s only a matter of time before somebody like Enthought latch onto the idea of selling a Python environment with IKML baked in, so you can get the speedups without the hassle.
-
@ben The SciPy team was also unhappy about the licensing issue, so you’ll be
glad to hear that SciPy 0.7 was released under a single, BSD license.
You said “It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.” We are currently working on ways to expose the mathematical functions underlying NumPy to C, so that you can access it in your extension code. During the last Google Summer of Code, the Cython team implemented a friendly interface between Cython and NumPy. This means that you can code your algorithms in Python, but still have the speed benefits of C.
A number of posts above refer to plotting in 3D. I can recommend Enthought’s Mayavi2, which makes interactive data visualisation a pleasure:
http://code.enthought.com/projects/mayavi/
We are always glad for suggestions on how to improve SciPy, so if you do try it out, please join the mailing list and tell us more about your experience.
-
You should probably add GenStat to your list, this is a UK package
specialising in the biosciences. It’s a relative heavy-weight in stats having
come from Rothamsted Research (home of Fisher, Yates and Nelder). Nelder was the
actual originator of GenStat. GenStat is also free for teaching world-wide and
free for research to the developing world. It’s popularity is mainly within
Europe, Africa and Oceania, hence why many US researchers may not have heard of
it. I hope this helps
-
Wow, this is the funnest language flamewar I’ve seen.
I will note that no one defended SAS. Maybe those people don’t read blogs.
-
bill wrote:brendano,
27. February 2009 at 3:26 am :
Hmm, I thought I did. I do production work in SAS and mess around (test new stuff, experimental analyses) in R.
Bill
-
Oops. Yes yes. My bad!
OK: no one has defended Stata!
-
John Dudley wrote:My company has been using StatSoft’s Statistica for years and it does all of the things that you found to be shortcomings of SAS, SPSS and Matlab…
4. March 2009 at 2:46 pm :
It’s fast, graphs are great and are virtually no limitations. I’m suprised it wasn’t listed as one of the packages reviewed. We have been using it for years and it is absolutely critical to our business model.
-
StatSoft is the only major package with R integration…The best of both
worlds.
-
Abhijit wrote:In stats there seems to be the S-Plus/R schools and the SAS schools. SAS people find R obtuse with poor documentation, and the R people say the same about SAS (myself included). R wins in graphics and flexibility and customizability (though I certainly won’t argue with a SAS pro who can whip up macros). SAS seems a bit better with large data sets. R is ever expanding, and has improved greatly for simulations/looping and memory management. Recently for large datasets (bioinformatic, not the 5-10G financial ones), I’ve used a combination of Python and R to great effect, and am very pleased with the workflow. I think rpy2 is a great addition to Python and works quite well. For some graphs I actually prefer matplotlib to R.
5. March 2009 at 3:38 am :
I’m also a big fan of Stata for more introductory level stuff as well as for epidemiology-related stuff. It is developing a programming language that seems useful. One real disadvantage in my book is its ability to hold only one dataset at a time, as well as a limit on the data size.
I’ve also used Matlab for a few years. It’s statistics toolbox is quite good, and Matlab is pretty fast and has great graphics. It’s limited in terms of regression modeling to some degree, as well as survival methods. Syntactically I find R more intuitive for modeling (though that is the lineage I grew up with). The other major disadvantage of matlab is distribution of programs, since Matlab is expensive. The same complaint for SAS, as well:)
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata ? X [...]
-
I’ll sing the same song here as I do elsewhere on this topic.
In large-scale production, SAS is second to none. Of course, large-scale production shops usually have the $$$ to fork over, and SAS’s workflow capabilities (and, to a lesser extent, large dataset handling capabilities) save enough billable hours to justify the cost. However, for graphics, exploratory data analysis, and analysis beyond the well-established routines, you have to venture into the world of SAS/IML, which is a rather painful place to be. It’s PRNGs are also stuck in the last century, top of the line of a class obsolete for anything other than teaching.
R is great for simulation, exploratory data analysis, and graphics. (I disagree with the assertion that R can’t do high-quality graphics, and, like some commenters above, recommend Paul Murrell’s book on the topic.) It’s language, while arcane, is powerful enough to write outside-the-box analyses. For example, I was able to quickly write, debug, and validate an unconventional ROC analysis based on a paper I read. As another example, bootstrapping analyses are much easier in R than SAS.
In short, I keep both SAS and R around, and use both frequently.
I can’t comment too much on Python. MATLAB (or Octave or Scilab) is great for roll-your-own statistical analyses as well, though I can’t see using it for, e.g., a conventional linear models analysis unless I wanted the experience. R’s matrix capabilities are enough for me at this point. I used Mathematica some time ago for some chaos theory and Fourier/wavelet analysis of images and it performed perfectly well. If I could afford to shell out the money for a non-educational license, I would just to have it around for the tasks it does really well, like symbolic manipulation.
I used SPSS a long time ago, and have no interest in trying it again.
-
SPSS has for several years been offering smooth integration with both Python
and R. There are extensive apis foe both. Check out the possibilities at http://www.spss.com/devcentral. See
also my blog at insideout.spss.com.
You can even easily build SPSS Statistics dialog boxes and syntax for R and Python programs. DevCentral has a collection of tools to facilitate this.
This integration is free with SPSS Base.
-
[...] comparando software statísticos (R, SAS, SPSS, MATLAB e Stata).
[...]
-
Sean wrote:I used Matlab, R, stata, spss and SAS over the years.
11. March 2009 at 4:18 am :
To me, the only reason for using sas is because of its large data ability. otherwise, it is a very very bad program. It, from day one, trains it users to be a third rate programmer.
The learning curve for SAS is actually very steep, particularily for a very logical person. Why? the whole syntax in SAS is pretty illogical and inconsistent.
sometimes, it is ‘/out’ sometimes, it is ‘output’.
In 9.2, SAS started to make variables inside a macro as local variables by default.
This is ridiculous!! SAS company has existed for at least 30 years. How can this basic programming rule should be implemented after 30 years?!
Also, if a variable is uninitialized, SAS will still let the code run. One time, I worked in a company, this simple stupid SAS design flaw causes our project 3 weeks of delay (there is one uninitialized varaible among 80k lines of log, all blue). A couple of PhDs in the project who used C and Matlab did not believe why SAS makes such a stupid mistake. Yes, with a big disbelief, it made!
My ranking is that Matlab and R are about the same, Matlab is better in plots most times. R is better is manipulation datasets. stata and SAS are the same level.
After taking into account of cost, then the answer is more obvious.
-
bill r wrote:SAS was not designed by a language maven, like Pascal. It grew from its PL/1 and Fortran roots. It is a collection of working tools, added to meet the demands of working statisticians and IT folk, that has grown since its start in the late ’60s and early ’70s. SAS clearly has kruft that shows its growth over time. Sort of like the UNIX tools, S, and R, actually.
12. March 2009 at 1:37 pm :
And, really, what competent programmer would ever use a variable without initializing or testing it first? That’s a basic programming rule I learned back in the mid ’60s, after branching off of uninitialized registers, and popping empty stacks.
Bah, you kids. Get off of my lawn!
-
tom p wrote:i work for a retail company that deploys SAS for their large datasets and complex analysis. just about everything else is done in excel.
13. March 2009 at 4:57 am :
we had a demo of omniture’s discover onpremise (formerly visual sciences), and the visualization tools are fairly amazing. it seems like an interesting solution for trending real time evolving data, but we aren’t pulling the trigger on it now.
-
For reference PDL (Perl Data Language) can be found at pdl.perl.org/
and is also available via CPAN
/I3az/
-
opps.. link screwed up… here goes again ;-)
pdl.perl.org
-
Have you seen Resolver One?
It’s a spreadsheet like Excel, but has built-in Python support, and allows cells
in the grid to hold objects. This means that numpy mostly works, and you can
have one cell in the grid hold a complete dataset, then manipulate that dataset
in bulk using spreadsheet-like formulae. Someone has also just built an extension that
allows you to connect it to R, too. In theory, this means that you can get
the best of all three — spreadsheet, numpy, and R — in your model, using the
right tool for each job.
On the other hand, the integration with both numpy and R is quite new, so it’s immature as a stats tool compared to the other packages in this list.
Full transparency: I work for Resolver Systems, so obviously I’m biased towards it :-) Still, we’re very keen on feedback, and we’re happy to give out free copies for non-commercial research and for open source projects.
-
Being the resident MATLAB enthusiast in a house built on another tool, I will
pitch in my two cents, by suggesting another spectrum along which these tools
lie: “canned procedures” versus “roll your own”. Use of general-purpose
programming languages, such as has been suggested in the comments for Fortran or
C/C++ clearly anchor one end of this dimension, whereas the statistical software
sporting canned routines lie all the way at the other. A tool like MATLAB, which
provides some but not complete direct statistical support, is somewhere in the
middle. The trade-off here, naturally, is the ability to customize analysis vs.
convenience.
-
Jude Ryan wrote:Most of the users on this post are biased towards packages like R, rather than packages like SAS, and I want to offer my perspective of the relative advantages and disadvantages of SAS relative to R.
16. March 2009 at 4:37 pm :
I am primarily a SAS user (over 20 years) who has been using R as needed (a few years) to do things that SAS cannot do (like MARS splines), or cannot do as well (like exploratory data analysis and graphics), or requires expensive SAS products like Enterprise Miner to do (like decision trees, neural networks, etc).
I have worked primarily for financial service (credit cards) companies. SAS is the primary statistical analysis tool in these companies partly due to history (S, the precursor to S+ and R, was not yet developed) and partly because it can run on mainframes (another legacy system) accessing huge amounts of data stored on tapes, which I am not sure any other statistical package can. Furthermore, business who have the $ will be the last to embrace open source software like R, as they generally require quick support when they get stuck trying to solve a business problem, and researching the problem in a language like R is generally not an option in a business setting.
Also, SAS’ capabilities for handling large volumes of data are unmatched. I have read huge compressed files of online data (Double Click), having over 2 billion records, using SAS, to filter the data and keep only the records I needed. Each of the resulting SAS datasets were anywhere from 35 GB to 60 GB in size. As far as I know, no other statistical tool can process such large volumes of data programatically. First we had to be able to read in the data and understand it. Sampling the data for modeling purposes came later. I would run the SAS program overnight, and it would generally take anywhere from 6 to 12 hours to complete, depending on the load on the server. In theory, any statistical software that works with records one at a time should be able to process such large volumes of data, and maybe the Python based tools can do this. I do not know as I have never used them. But I do know that R, and even tools like WEKA cannot process such volumes of data. Reading the data from a database, using R, can mitigate the large data problems encountered in R (as does using packages like biglm), but SAS is the clear leader in handling large volumes of data.
R on the other hand is better suited for academics and research, as cutting edge methodologies can be and are implemented much more rapidly in R than in SAS, as R’s programming language has more elegant support for vectors and matricies than SAS (proc IML). R’s programming language is much more elegant and logically consistent, while SAS’ programming language(s) are more adhoc with non-standard programming constructs. Furthermore, people who prefer R generally have a stronger “theoretical” programming background (most have programmed in C, Perl, or objected oriented languages) or are able to pick up programming faster, while most users who feel comfortable with SAS have less of a programming background and can tolerate many of SAS’ non-standard programming constructs and inconsistencies. These people do not require or need a comprehensive programming language to accomplish their tasks, and it takes much less effort to program in base SAS than in R if one has no “theoretical” programming background. SAS macros take more time to learn and many programming languages have no equivalent (one exception I know are C’s pre-processor commands). But languages like R do not need anything like SAS macros and can achieve the same results all in one, logically consistent, programming language, and do more, like enabling R users to write their own functions. The equivalent to writing functions in R, in SAS, is to now program a new proc in C and know how to integrate it with SAS. An extremely steep learning curve. SAS is more of a suite of products, many of them with inconsistent programming constructs (base SAS is totally different from SCL - formerly Screen Control language but now SAS Component Language), and proc SQL and proc IML are different from data step programming.
So while SAS has a shallow learning curve initially (learn only base SAS), the user can only accomplish tasks of “limited” sophistication with SAS, without resorting to proc IML (which is quite ugly). For the business world this is generally adequate. R, on the other hand, has a steeper learning curve initially, but tasks of much greater sophistication can handled more easily in R than is SAS, once R’s steeper learning curve is behind you.
I forsee an increased use of R relative to SAS over time, as many statistical departments at Universities have started teaching R (sometimes replacing SAS with R) and students graduating from these universities will be more conversant with R, or equally conversant with both SAS and R. Many of these students entering the workforce will gravitate towards R, and to the extent the companies they work for do not mandate which statistical software to use, the use of R is bound to increase over time. With memory becoming cheaper, and Microsoft based 64 bit operating systems becoming more prevalent, bigger data sets can be stored in RAM, and R’s limitation in handling large volumes of data are starting to matter less. But the amount of data is also starting to grow, thanks to the internet, scanners (used in grocery chains), etc., and the volume of data may very well grow so rapidly that even cheaper RAM and 64 bit operating systems may not be able to cope with the data deluge. But not every organization works with such large datasets.
For someone who has started their careers using SAS, SAS is more than adequate to solve all problems faced in the business world, and there may seem to be no real reason, or even justification to learn packages like R or other statistical tools. To learn R, I have put in much personal time and effort, and I do like R and have been and forsee using it more frequently over time for exploratory data analysis, and in areas where I want to implement cutting edge methodologies, and where I am not hampered by large data issues. Personally, both SAS and R will always be part of my “tool kit” and I will leverage the strengths of both. For those who do not currently use R, it would be wise to start doing so, as R is going to be more widely used over time. The number of R users has already reached critical mass, and since R is free, this is bound to increase the usage of R as the R community grows. Furthermore, the R Help Digest, and the incredibly talented R users that support it, is an invaluable aid to anyone interested in learning R.
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata - Brendan O’Co… statistics software No comments yet. [...]
-
Interesting. I don’t think I would have put SPSS and Stata in the same
category. I haven’t spend a tremendous amount of time working with SPSS, but I
have spent a fair amount of time with Stata, and my biased perspective is that
Stata is more sophisticated and powerful than SPSS. Certainly, Stata’s language
isn’t as powerful as R’s, but I definitely wouldn’t say it’s “weak.” Stata’s not
my favorite statistical program in the world (that would, of course, be R), but
there are definitely things I like about it; it’s a definite second to R in my
book.
By the way, here’s my (unfair) generalization regarding usage:
– R: academic statisticians
– SAS: statisticians and data-y people in non-academic settings, plus health scientists in academic and non-academic settings
– SPSS: social scientists
– Stata: health scientists
-
Walking Randomly » R Compared to MATLAB (or ‘learning a thing or two from your students’) wrote:[...] matrices. You don’t get much more MATLABy than matrices! Other articles such as this comparison between various data analysis packages also proved interesting and [...]
23. March 2009 at 5:58 pm :
-
xin wrote:Sean:
19. April 2009 at 2:03 am :
I am a junior SAS user with only 3 year experience. But even I know that you need to press ‘ctrl’ and ‘F’ to search for ‘uninitialized’ and ‘more than’ in SAS log to ensure everything is OK.
As far as a couple C++PHD in your group is concerned, they need to understand to play with rules of whatever system they are using……
-
xin wrote:by the way, I found the comments of SAS people left are more tolerant, open-minded (maybe they are older, lol). Instad the majority of ‘R’ers on this thread act like a bunch of rebellious teens…..
19. April 2009 at 2:07 am :
-
Joe wrote:I am a big fan of Stata over SAS for medium and small businesses. SAS is the mercedes-benz of stats I’ll admit for Govt and Big business. I use Stata a LOT for economics, it has all the most-used predictive methods (OLS, MLE, GLS, 2SLS, binary choice, etc) models built it. I think the model would have to be pretty essoteric not to be found in Stata.
30. April 2009 at 6:58 pm :
I ran Stata on linux server with 16GB ram and about 2TB of disk storage. The Hardware config was about $12K. I would not recommend using virtual memory for Stata. That said, you can stick a lot of data in 16GB ram! If I pay attention to the variable sizes (keep textual ones out), I got 100s of millons of rows into memory.
Stata supports scripting (*do files) and are very easy to use as is the GUI. The GUI is probably the best feauture.
The Hardware ($12,000) + Software ($3000 - 2 user license) costs $15,000. The equivilient SAS software was about $100,000. You do the math.
I’ve used SPSS, but that was a while ago. At that time I felt Stata was the superior product.
-
Finally a direct Stata vs SAS comparison! Very interesting. Thanks for
posting. I can’t believe SAS = $100,000.
> I ran Stata on linux server with 16GB ram and about 2TB of disk storage.
> I would not recommend using virtual memory for Stata.
In my experience, virtual memory is *always* a bad idea. I remember working with ops guys who would consider a server as good as dead once it started using swap.
All programs that effectively use hard disks always have custom code to control when to move data on and off the disk. Disk seeks and reads are just too slow and cumbersome compared to RAM to have the OS try to automatically handle it.
This would be my guess why SAS handles on-disk data so well - they put a lot of engineering work into supporting that feature. Same for SQL databases, data warehouses, and inverted text indexes. (Or the widespread popuarity of Memcached among web engineers.) R, Matlab, Stata and the rest were originally written for memory data and still work pretty much only in that setting.
-
And also, on the RAM vs hard disk issue — according to Jude Ryan’s very
interesting comment above, SAS has a heritage of working with datasets on *tape*
drives. Tape, of course, is even further along the size-vs-latency spectrum than
RAM or hard disk. Now hard disk sizes are rapidly growing but seek times are not
catching up, so people like to say “hard disk is the new tape” — therefore, if
your software was originally designed for tape, it may do best! :)
-
Here’s an overly detailed comparison of Stata, SAS, and SPSS. Basically no
coverage of R beyond the complaint that it’s too hard.
http://www.ats.ucla.edu/stat/technicalreports/
There’s also an interesting reply from Patrick Burns, defending R and comparing it to those 3.
http://www.ats.ucla.edu/stat/technicalreports/Number1/R_relative_statpack.pdf
(Found linked from a comment on John D. Cook’s blog here:
http://www.johndcook.com/blog/2009/05/01/r-the-good-parts/ )
-
Jaime wrote:I feel so old. Been using SAS for many years. But what the hell is this R ?????? That’s what the kids are using now?
27. May 2009 at 9:37 pm :
-
Gye Greene wrote:Great comparison of SPSS, SAS, and Stata by Acock (a summary of his findings here — http://www.ocair.org/files/KnowledgeBase/willard/StatPgmEvalAb.pdf)
28. May 2009 at 4:54 am :
Below is a summary of the summary — !!! — with my own observations added on.
SAS: Scripting language is awkward, but it’s great for manipulating complex data structures; folks that analyze relational DBs (e.g. govt. folks) tend to use it.
SPSS: Great for the “weekend warriors”; strongly GUI-based; has a scripting language, but it’s in-elegant. They charge a license for **each** “module” (e.g. correlations? linear regressions? Poisson regressions? A separate fee!). Also, charge an annual license. Can read Excel files directly. Used to have nicer graphs and charts than Stata (but, see below).
Stata: Elegant, short-’n'-punchy scripting language; CLI and script-oriented, but also allows GUI. Strong user base, with user-written add-ons available for D/L. **Excellent** tech support! The most recent version (Stata 10) now has some pretty powerful chart/graph editing options (GUI, plus CLI, your choice) that makes it competitive with the SPSS graphs. (Minor annoyance: ever few versions, they make the data format NOT back-compatible with the previous version — have to remember to “Save As” last-year’s version, or else what you save at work won’t open at home…)
My background: Took a course on SAS, but haven’t had a reason to use it. I’ve used SPSS and Stata both, on a reasonably regular basis: I currently teach “Intro to Methods” courses with SPSS, but use Stata for my own work. I dislike how SPSS handles missing values. Unlike SPSS, Stata sells a one-time license: once you buy a version, it’s yours to keep until you feel it’s too obsolete to use.
–GG
-
Gye Greene wrote:This may be an unfair generalization, but my personal observation is that SPSS users (within the social sciences, at least) tend to have less quantitative training than Stata users. Probably highly correlated with the GUI vs. CLI orientations of the two packages (although each of them allows for both).
28. May 2009 at 1:53 pm :
Another way of’ differentiating between various statistical software packages is its Geek Cred. I usually tell my Intro to Research Methods (for the social sciences), that…
(On a scale of 0-10…)
R, Matlab, etc. = 9
SAS = 7
Stata = 5
SPSS = 3
Excel = 2
YMMV. :)
COMMENT ON EXCEL: It’s a spreadsheet, first and foremost — so it doesn’t treat rows (cases) as “locked together”, like statistical software does. Thus, when you highlight a column and ask it to sort, it sorts **only** that column. I got burned by this once, back in my first year of grad school, T.A.-ing: sorted HW #1 scores (out of curiosity), and didn’t notice that the rest of the scores had stayed put. Oops.
I now keep my gradebooks in Stata. :)
–GG
-
Chuck Moore wrote:I began programming in SAS every day at a financial exchange in 1995. SAS has three main benefits over all other Statistical/Data Analysis packages, as far as I know.
29. May 2009 at 1:29 pm :
1) Data size = truly unlimited. I learned to span 6 DASD (Direct Access Storage Devices) = disk drives on the mainframe for when I was processing > 100 million records = quotes and trading activity from all exchanges. We we went to Unix, we used 100 GB worth of temp “WORK” space, and were processing > 1 Billion transaction a day in < 1 hour (IBM p630 with 4x 1.45 GHz processors and 32 GB of memory, only the processing actually used < 4 GB).
2) Tons and tons of preprogrammed statistical functions with just about every option possible.
3) SAS can read data from almost anything: tapes, disk, etc. fixed field flat files, delimited text files (any delimiters, not just comma or tab or space), xml, most any database, all mainframe data file times. It also translates most any text value into data, and supports custom input and output formats.
SAS is difficult for most real programmers (I took my first programming class in 1977, and have programmed in more languages than I care to share) because it has a data centric perspective as opposed to machine/control centric. It is meant to simplify the processing of large amounts of data for non-programmers.
SAS used to have incredible documentation and support, at incredibly reasonable prices. Unforturnately, the new generation of programmers and product managers have lost their way, and I agree that SAS has been becoming a beast.
For adhoc work, I immediately fell in love with SAS/EG = Enterprise Guide. Unfortunately, EG is written in .net and is not that well written. I would have preferred it being written in Java so that the interface was more portable and supported a better threading model. Oh well.
One of the better features of SAS is that it is not an intepreted programming language, but from the start in 197? it was JIT. Basically, a block of code is read, compiled, and then executed. This is why it is so efficient at processing huge amounts of data. The concept of the “data step” does allow for some built in inefficiencies from the standpoint of multiple passes through the data, but that is because of SAS’s convenience. A C programmer would have done more things, in fewer passes, but the C programmer would have spent many more hours writing the programmer than SAS’s few minutes to do the same thing. I know this because I’ve done it.
Some place I read a complaint about SAS holding only one observation in memory at a time. That is a gross misunderstanding/mistake. SAS holds one or more blocks of observations (records) in memory at a time. The number held is easily configurable. Each observation can be randomly accessed, whether in memory or not.
SAS 9.2 finally fixes one the bigger complaints with PROC FCMP allowing the creation of custom functions. Originally SAS did not support custom functions, SAS wanted to write them for you.
The most unfortunate thing about SAS currently is that it has such a long legacy on uniprocessor machines, that it is having difficulty getting going in the SMP world, being able to properly take advantage of multi-threading and multi-processing. I believe this is due to lack of proper technical vision and leadership. As such, I believe a Java language HPC derivative and tools will eventually take over, providing superior ease of use, visualization, portability, and processing speed on today’s servers and clusters. Since most data will come from an RDMS these days, flat file input won’t carry enough weight.
But, for my current profession = Capacity Planning for computer systems, you still can’t beat SAS + Excel. On the other hand, it looks like I’m going to have to look into R.
-
Chuck Moore wrote:On a side note. As a “real” programmer, having been an expert in Pascal and C and having programmed in, oh I don’t want to list them all, but I have also done more than just take classes in Java. Anyway, Macros have a place in programming. There have been a few times I wished Java supported macros and not just assertions, out of my own laziness. I am a firm believer in the right tool for the job, and that not everything is a nail, so I need more than just a hammer. The unfortunate thing is that macros can be abused, just like goto’s and programming labels and global variables.
29. May 2009 at 1:47 pm :
To me, SAS is/was the greatest data processing language/system on the planet. But, I still also program in Java, C, ksh, VBScript, Perl, etc. as appropriate. I’d like to see someone do an ARIMA forecast in Excel, or run a regression that does outlier elimination in only 3 lines of code!
-
If your dataset can’t fit on a single hard drive and you need a
cluster, none of the above will work.
One thing you have to consider, is that using SciPy, you get all of the python libraries for free. That includes the Apache Hadoop code, if you choose to use that. And as someone above pointed out, there is now parallel processing built right in in the most recent distributions (but I have no personal knowledge of that) for MPI or whatever.
Coming from an engineer in industry (not academia), the really neat thing that I like about SciPy is the ease of creating web-based tools (as in, deployed to a web server for others to use) via deployment on an apache installation and mod_python. If you can get other engineers using your analysis, without sending them a excel spreadsheet, or a .m file (for which they need a matlab license), etc. it makes your work much more visible.
-
sohan wrote:hello everyone…
14. June 2009 at 10:30 am :
i want to know about the comrative study between SAS, R, SPSS in data analysis.
can anyone provide me the papers related to those.
-
ed wrote:having used sas, spss, matlab, gauss and r, let me say that describing stata as having a weak programming language is a sign of ignorance.
18. June 2009 at 11:28 am :
it has a very powerful interpreted scripting language which allows one to easily extend stata. there is a very active community and many user written add-ons are available. see: http://ideas.repec.org/s/boc/bocode.html
stata also has a full fledged matrix programming language called (mata) comparable to matlab with a c-like syntax, which is compiled and therefore very fast.
managing and preparing data for analysis is a breeze in stata.
finally stata is easy to learn.
obviously not many people use stata around here.
some more biased opinions:
sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?
excel sucks obviously infinitely more that sas. apart from its (lack of) statistical capabilities and reliability, any point-and-click only software is an obvious no-no from the point of view of scientific reproducability
i don’t care fore spss and cannot imagine anyone does.
matlab is nice, but expensive. not so great for preparing/managing data.
have not used scipy/numpy myself, but have colleagues who love it. one big advantage is that it uses python (ie good language to master and use)
r is great, but more difficult to get into. i don’t like the loose syntax too much though. it is also a bitch with big datasets.
-
Willem wrote:On high quality graphics in R, one should certainly check out the Cairo-package. Many graphics can be output in hip formats like SVG.
17. July 2009 at 6:53 am :
-
On the point of Excel breaking down at 10,000+ rows, apparently Excel 2010
will come with Gemini, an add-on developed by the Excel and SQL team, aiming at
handling large datasets:
Project Gemini sneak preview
I doubt this would make Excel the platform of choice for doing anything fancy with large datasets anyways, but I am intrigued.
-
Jay Verkuilen wrote:Some reax, as I’ve used most of these at some point:
26. July 2009 at 9:48 pm :
SAS has great support for large files even on a modest machine. A few years ago I did a bunch of sims on my dissertation using it and it worked happily away without so much batting an eyelash on a crappy four year old Windoze XP machine with 1.5 GB of memory. Also, programs like NLP (nonlinear optimization), NLMIXED, MIXED, and GLIMMIX are really great for various mixed model applications—this is quite broad as many common models can be cast in the mixed model framework. NLMIXED in particular lets you write some pretty interesting models that would otherwise require special coding. Documentation in SAS/STAT is really solid and their tech support is great. Graphics suck and I don’t like the various attempts at a GUI.
I prefer Stata for most “everyday” statistical analysis. Don’t knock that, as it’s pretty common even for a methodologist such as myself to need to fit logistic regression or whatever and not want to have to waste a lot of time on it, which Stata is fantastic for. Stata 11 looks to be even better, as it incorporates procedures such as Multiple Imputation easily. The sheer amount of time spent doing MI followed by logistic regression (or whatever) is irritating. Stata speeds that up. Also when you own Stata you own it all and the upgrade pricing is quite reasonable. Tech support is also solid.
SPSS has a few gems in its otherwise incomprehensible mass of utter bilge. IMO it’s a company with highly predatory licensing, too.
R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility. I like it for class because it’s free, there are nice books for it, and it lets me bypass IT as it’s possible to put a working R system on a USB drive. I love the graphics.
Matlab has made real strides as a programming language and has superb numerics in it (or did), at least according to the numerics people I know (including my numerical analysis professor). However, Statistics Toolbox is iffy in terms of what procedures it supports, though it might have been updated. Graphics are also nice. But it is expensive.
Mathematica is nice for symbolic calculation. With the MathStatica addon (sadly this has been delayed for an unconscionable amount of time) it’s possible to do quite sophisticated theoretical computations. It’s not a replacement for your theoretical knowledge, but is very helpful for doing all the inaccurate and tedious calculations necessary.
-
Brett D wrote:I started in Matlab, moved on to R, looked at Octave, and am just getting into SciPy.
27. July 2009 at 10:58 am :
Matlab is good for linear algebra and related multivariate stats. I could never get any nice plotting out of it. It can do plenty of things I never learnt about, but I can’t afford to buy it, so I can’t use it now anyway.
R is powerful, but can be very awkward. It can write jpeg, png, and pdf files, make 3D plots and nice 2D plots as well. Two things put me off it: it’s an absolute dog to debug (how does “duplicate row names are not allowed” help as an entire error message when I’ve got 1000 lines of code spread between 4 functions?), and its data types have weird eccentricities that make programming difficult (like transposing a data frame turns it into a matrix, and using sapply to loop over something returns a data frame of factors… I hate factors). There are a lot of packages that can do some really nice things, although some have pretty thin documentation (that’s open source for you).
Octave is nicer to use than R ( = Matlab is nicer to use than R), but I found it lacking in most things I wanted to do, and the development team seem to wait for something to come out in Matlab before they’ll do it themselves, so they’re always one step behind someone else.
I’m surprised how quickly I’m picking up SciPy. It’s much easier to write, read and debug than R, and the code looks nicer. I haven’t done much plotting yet, but it looks promising. The only trick with Python is its assignments for mutable data types, which I’m still getting my head around.
-
Mike wrote:Mathematica is also able to link to R via a third party add-on distributed by ScienceOps. The numeric capabilities of Mathematica were “ramped” up 6 years ago so should be thought of as more than a symbolic (only) environment. Further info here:
29. July 2009 at 9:45 pm :
http://reference.wolfram.com/mathematica/note/SomeNotesOnInternalImplementation.html#28959
(I work for Wolfram Research)
-
R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility.
Hah!
Everyone really likes Stata. Interesting.
-
I use Python/Matlab for most analysis, but Mathematica is really nice for
building demos and custom visualization interfaces (and for debugging your
formulas)
For instance, here’s an example of taking some mutual fund data, and visualizing those mutual funds (from 3 different categories) in a Fisher Linear Discriminant transformed space (down to 3 dimensional from initial 57 or so)
http://yaroslavvb.com/upload/strands/dim-reduce/dim-reduce.html
-
A post on R vs. Matlab: To R or
not to R
-
Also, a discussion looking for solutions that are both fast to prototype and
fast to execute: suitable
functional language for scientific/statistical computing
-
Cristian wrote:I do not understand why SAS is so much hailed here because it handles large datasets. I use Matlab almost exclusively in finance and when I have problems with how large the data sets are then I don’t use SAS by I use mysql server instead. Matlab can talk to mysql server and thus I do not see why SAS is needed in this case.
1. September 2009 at 3:21 am :
-
Mike wrote:I have used Stata and R but for my purposes I actually prefer and use Mathematica. Unsurprisingly nobody has discussed its use so I guess I will.
11. September 2009 at 6:38 am :
I work in ecology and I use Mathematica almost exclusively for modeling. I’ve found that the the elegance of the programming language lends itself to easily using it for statistical analysis as well. Although it isn’t really a statistics package being able to generate large amounts of data and then process them in the same place is extremely useful. To make up for the lack of built in statistical analysis I’ve built my own package over time by collecting and refining the tests I’ve used.
For most people I would say using Mathematica for statistics is way more work than it is worth. Nevertheless, those who already use it for other things may find it is more than capable of performing almost any data analysis you can come up with using relatively little code. The addition of functionality targeted at statistics in versions 6 and 7 has made this use simpler, although the built in ANOVA package is still awkward and poorly documented. One thing it and Matlab beat other packages at hands down is list/matrix manipulation which can be extremely useful.
-
Paul Kim wrote:I am using MATLAB along with SPSS. Does anyone know about how to connect SPSS with MATLAB? Or can we use any form of programming (e.g., “for” loops and “if”) in SPSS to connect with MATLAB?
14. September 2009 at 9:10 pm :
Thank you.
Paul
-
Mattia wrote:I worked at the International Monetary Fund so I thought I’d add the government perspective, which is pretty much the same as the business one. You need software that solves the following equation
25. September 2009 at 1:39 pm :
maximize amount of useful output
such that: salaries of staff * hours worked - cost of software < budget
It turns out IMF achieves that by letting every economist work with whatever they want. As a matter of fact, economists end up using Stata.
Consider that most economics datasets are smaller than 1Gb. Stata MultiProcessor will work comfortably with up to 4Gb on the available machines. Stata has everything you need for econometrics, including a matrix language that is just like Matlab and state of the art maximum likelihood optimization, so you can create your own “odd” statistical estimators. Programming has a steeper learning curve than Matlab but once you know the language it’s much more powerful, including very nice text data support and I/O (not quite python, but good enough). If you don’t need some of the fancy add-on packages that engineers use, like say “hydrodynamics simulation”, that’s all you need. But most importantly importing, massaging and cleaning data with Stata is so unbelievably efficient that every time I have to use another program I feel like I am walking knee-deep in mud.
So why do I have to use other programs, and which?
IMF has one copy of SAS that we use for big jobs, such as when I had 100Gb of data. I won’t dwell on this because it’s been covered above, but in general SAS is industrial-grade stuff. One big difference between SAS and other programs is that SAS will try to keep working when something goes wrong. If you *need* numbers for the next morning, you go to bed, the next morning you come and Stata has stopped working because of a mistake. SAS hasn’t, and perhaps your numbers are garbage, but if you are able to tell that they are simply 0.00001% off then you are in perfectly good shape to make a decision.
Occasionally I use Matlab or Gauss (yes, Gauss!) because I need to put the data through some black box written in that language and it would take too long to understand it and rewrite it.
That’s all folks. Thanks for the attention.
-
Mattia wrote:No that was not all, I forgot one thing. Stata can map data using a free user-written add-in (spmap), so you can save yourself the time of learning some brainy GIS package. Does anyone know whether R, SAS, SPSS or other programs can do it?
25. September 2009 at 6:42 pm :
-
R has some packages for plotting geo data, including “maps”, “mapdata”, and
also some ggplot2 routines. Now I just saw an entire “R-GIS” project, so I’m
sure there’s a lot more related stuff for R…
-
مقایسه بستههای تحلیل داده (R, Matlab, SciPy, Excel, SAS, SPSS, Stata) « دنیای پیرامون wrote:[...] اینکه ببینم کدوم مناسبتره شروع به مقایسه کردم. توی یک وبلاگ یک مقایسه ساده و البته تقریبا عمیقی پیدا کردم. اون رو [...]
30. September 2009 at 6:59 am :
-
Tao Wu wrote:Hi, all. I think I should mention about a C++ framework based software, named as ROOT. see http://root.cern.ch
30. September 2009 at 5:56 pm :
You will see ROOT is definitely better than R.
-
Tao Wu wrote:As I can see, the syntax and grammar of R are really stupid. I can not image that R, S, S+ have been widely used by financial bodies. Furthermore, they are trying to claim they are very professional and very good at financial data analysis. I can predict that if they shift to ROOT (a real language with C++), they will see the power of data analysis.
30. September 2009 at 5:59 pm :
-
xin (April 19) writes:
> the majority of ‘R’ers on this thread act like a bunch of rebellious teens …
Well spotted — I’ve been a rebellious teen for decades now.
-
Wei Zhang wrote:People in my work place, an economic research trust, love STATA. Economists love STATA and they ask new comers to use STATA as well. R is discouraged in my work place for excuses like it is for statisticians. Sigh~~~~
10. January 2010 at 10:35 am :
But!!! I keep using it and keep discovering new ways of using it. Now, I use ‘dmsend’ function from the ‘twitteR’ package to inform me the status of my time-consuming simulations while I am not in office. It is just awesome that using R makes me feel bounded by nothing.
BTW, anyone knows how to use R to send emails (on various OS, Win, Mac, Unix, Linux). I googled a bit and not very promising. Any plans to develop a package?
If we had the package, we can just hit ‘paste to console’ (RWinEdt) or C-c C-c (ESS+Emacs) and let R to estimate, simulate and send results to co-authors automatically. What a beautiful world!!
I use Matlab and STATA as well but R completely owns me. Being a bad boy naturally, I start to encourage new comers to use R in my work place.
-
ynte wrote:I happened to hit this page, and I am impressed by the pro’s and con’s.
13. January 2010 at 8:30 pm :
Been using SPSS for over 30 years and I’ve been appreciating the steep increase in usability from punch card syntax to pull down menu’s. I only ran into R today because it can handle Zero Inflated Poisson Regression and SPSS can’t or won’t.
I think it is Great to find open source statistical software. I guess it requires a special ment framework to actually enjoy struggling through the command structure, but if I were 25 years younger………
It really is a bugger to find that SPSS (or whatever they like to be called) and R come up with different parameter estimates on the same dataset [at least in the negative binomial model I compared].
Is there anyone out there with experience in comparing two or more of these packages on one and the same dataset?
-
Wei wrote:@ynte
16. January 2010 at 9:58 am :
Why don’t you join R: mailing list? If you ask questions properly there, you will get answers.
I would suggest a place to start: http://www.r-project.org/mail.html
Have fun.
-
peng wrote:hi friends,
27. January 2010 at 10:22 am :
I am new to R.I would like to know R-PLUS.Does any know where can I get the free training for R-PLUS.
Regards,
Peng.
-
Wayne wrote:I use R.
12. February 2010 at 8:38 pm :
I’ve looked at Matlab, but the primitive nature of its language turns my stomach. (I mean, here’s a language that uses alternating strings and values to imitate named parameters? A language where it’s not unusual to have a half page of code in a routine dedicated to filling in parameters based on the number of supplied arguments.) And the Matlab culture seems to favor Perleqsue obfuscation of code as a value. Plus it’s expensive. It’s really an engineer’s tool, not a statistician’s tool.
SAS creeps me out: it was obviously designed for punched cards and it’s an inconsistent mix of 1950’s and 1960’s languages and batch command systems. I’m sure it’s powerful, and from what I’ve read the other statistics packages actually bend their results to match SAS’s, even when SAS’s results are arguably not good. So it’s the Gold Standard of Statistics ™, literally, but it’s not flexible and won’t be comfortable for someone expecting a well-designed language.
R’s language has a good design that has aged well. But it’s definitely open source: you have two graphical languages that come in the box (base and lattice), with a third that’s a real contender (ggplot2). Which to choose? There are over 2,000 packages and it takes a bit of analysis just to decide which of the four Wavelet packages you want to use for your project — not just current features, but how well maintained the package appears to be, etc.
There are really three questions to answer here: 1) What field are you working in, 2) How focused are your needs, and 3) What’s your budget?
In engineering (and Machine Learning and Computer Vision), 95% of the example code you find in articles, online, and in repositories, will be Matlab. I’ve done two graduate classes using R where Matlab was the “no brainer” choice, but I just can’t stomach Matlab “programming”. Python might’ve been a good choice as well, but with R I got an incredible range of graphics combined with multiple a huge variety of statistical and learning techniques. You can get some of that in Python, but it’s really more of a general-purpose tool when you definitely have to roll your own.
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata – Brendan O… – Lukas and I were trying to write a succinct
comparison of the most popular packages that are typically used for data
analysis. I think most people choose one based on what people around them use or
what they learn in school, so I’ve found it hard to find comparative
information. I’m posting the table here in hopes of useful comments. [...]
-
Jay wrote:Yeah, quite the odd list. If *Py stuff is in there, then PDL definitely should be too.
17. February 2010 at 9:43 am :
-
[...] Comparison of data analysis packages from Brendan O’Connor [...]
-
stat_stuff wrote:i like what you wrote to describe spss, clear and consise….nuf said :-)
25. February 2010 at 10:24 am :
-
forkandwait wrote:I would like to comment on SAS versus R versus Matlab/ Octave.
27. February 2010 at 12:05 am :
SAS seems to excel at data handling, both with large datasets and with wacked proprietary formats (how else can you read a 60GB text file and merge it with an access database from 1998). It is really ugly though, not interactive/ exploratory, and graphics aren’t great.
R is awesome because it is a fully featured language (things like named parameters, object orientation, typing) etc, and because every new data analysis algorithm probably gets implemented in it first these days. I rather like the graphics. However, it is a mess, with bad naming conventions that have evolved badly over time, conflicting types, etc.
Matlab is awesome in its niche, which is NOT data analysis, but rather math modeling with scripts between 10 and 1000 lines. It is really easy to get up an running if you have a math (ie linear algebra) background, the function file system is great for a medium level of software engineering, plotting is awesome and simpler than R, the datatypes (structs) are complex enough but dont’ involve the headaches of a “well developed” type system. If you are doing data management, gui interaction, or dealing with categorical data, it might be best to use SQL/ SAS or something else and export your data into matrices of numbers.
I would like numpy and friends, but ZERO BASED INDEXING IS NOT MATHEMETICAL.
Just my 2c
-
anlaystenheini wrote:This is a great compilation, thank you.
16. April 2010 at 4:52 pm :
After working as an econometrics analyst for a while mainly using stata, I can tell the following about STATA:
Stata is relativly easy to get startet with and to produce some graphics quickly (that’s what all the business people want, click click here’s your powerpoint presentation with lots of colourful graphics and no real content).
BUT if you want to automate things and if you want to make stata to do things it isn’t capable of out of the box, it is pure pain!
The big problem is: On one hand Stata has a scripting/command interface, which is not very powerful and very very inconsistent. On the other Hand, stata has a fully featured matrix-orientated programming language with c-like syntax, which is c-like, therefore not very handy (c is old and not made for mathematics, the matlab language is much more convenient), and which doesn’t work well with the rest of stata (you have a superflous level for interchanging data from one part to the other).
All together programming STATA feels like persuading STATA:
Error messages are almost useless, the macro text expansion used in the scripting language is not very suitable for things that has to do with mathematics (texts can’t calculate), and many other little things.
It is very inconsitent sometimes very clumsy to handle and has silly limitations like string expressions limited to 254 chars like in the early 20th century.
So go with stata for a little ad hoc statistics but do not use it for more sophisticated stuff, in that case learn R!
-
George Wolfe wrote:I’ve used Mathematica as a general purpose programming language for the past couple of years. I’ve built a portfolio optimizer, various tools to manipulate data and databases, and a lot of statistics and graphs routines. People who use commercial portfolio optimizers are always surprised at how fast the Mathamatica optimizations run - faster then their own optimizers. Based on my experience, I can say that Mathematica is great for numerical and ordinary computational tasks.
19. April 2010 at 11:13 pm :
I did have to spent a lot of time learning how to think in Mathematica - it’s most powerful when used as a functional language, and I was a procedural programmer. However, if you want to use a procedural programming approach, Mathematica supports that.
Regarding some of the other topics discussed above: (1) Mathematica has build in support for parallel computing, and can be run on supercomputing clusters (Wolfram Alpha is written in Mathematica). (2) The language is highly evolved and is being actively entended and improved every year. It seems to be in an exponential phase of development currently - Stephen Wolfram outlines the development plans every year and the annual user conferenced - and his expectations seem to be pretty much on target. (3) Wolfram has a stated goal of making Mathematica a universal computing platform which smoothly integrates theoretical and applied mathematics with general purpose, graphics, and computation. I admit to a major case of hero worship, but I think he is achiving this goal.
I’m going on and on about Mathematica because, in spite of its wonderfulness, it doesn’t seem to have taken it’s rightful place in these discussions. Maybe Mathematica users drop out of the “what’s the best language for x” after they start using it. I don’t know, really. But anyway, that’s the way I see it.
-
Dale wrote:I am amazed that nobody has mentioned JMP. It is essentially equivalent to SPSS or STATA in capabilities but far easier to use (certainly to teach or learn). The main reason why it is not so well known is that it is a SAS product and they don’t want to market it well for fear that nobody will want SAS any more.
25. April 2010 at 12:54 am :
-
ad wrote:In the comparison I did not see Freemat. This is a open source tool that follows along the lines of MATLAB. It would interesting to see how the community compares Freemat to Matlab
25. April 2010 at 1:23 pm :
-
bupka’s online menyediakan buku terpakai (used books) berkualitas dan
asli
original dengan harga miring,banyak buku teknik. silahkan kunjungi
http://bupka.wordpress.com
buku MATLAB yg dibicarakan diatas, ada stok saat ini.
silahkan liat2 lainnya juga.
-
Farhat wrote:@Wolfe: I have used Mathematica a lot over the past 8 years and still use it for testing ideas as small pieces of code can do fairly sophisticated stuff, I’ve found it poor for large datasets and longer code development. It even lacked things like support for a code versioning system until recently. The cost is also a major detractor. Mathematica costs like $2500 or so last time I checked. Also, some of the newer features like Manipulate seem to create issues, I had a small piece of code using that for interactivity which sent the CPU usage to 100% regardless of whether any change was happening or not.
27. April 2010 at 9:37 am :
Also, SAGE ( http://www.sagemath.org ), the open source alternative to Mathematica has gotten quite powerful in the last few years.
-
I just wanted to mention that Maple, which has not been commented on yet in
this post or in the subsequent thread, generates beautiful visuals and I used to
program in it all the time (as an alternative to Mathematica which was used by
the “other camp” and I wouldn’t touch).
Also, I’m starting to use Matlab now and loving how intuitive it is (for someone with programming experience anyway). st
-
Jason wrote:let me quote some of Ross Ihaka’s reflection on R’s efficiency….
9. May 2010 at 5:40 pm :
“I’m one of the two originators of R. After reading Jan’s
paper I wrote to him and said I thought it was interesting
that he was choosing to jump from Lisp to R at the same
time I was jumping from R to Common Lisp……
We started work on R in the early ’90s. At the time
decent Lisp implementations required much more resources
than our target machines had. We therefore wrote a small
scheme-like interpreter and implemented over that.
Being rank amateurs we didn’t do a great job of the
implementation and the semantics of the S language which
we borrowed also don’t lead to efficiency (there is a
lot of copying of big objects).
R is now being applied to much bigger problems than we
ever anticipated and efficiency is a real issue. What
we’re looking at now is implementing a thin syntax over
Common Lisp. The reason for this is that while Lisp is
great for programming it is not good for carrying out
interactive data analysis. That requires a mindset better
expressed by standard math notation. We do plan to make
the syntax thin enough that it is possible to still work
at the Lisp level. (I believe that the use of Lisp syntax
was partially responsible for why XLispStat failed to gain
a large user community).
The payoff (we hope) will be much greater flexibility and
a big boost in performance (we are working with SBCL so
we gain from compilation). For some simple calculations
we are seeing orders of magnitude increases in performance
over R, and quite big gains over Python…..”
the full post is here:
http://r.789695.n4.nabble.com/Ross-Ihaka-s-reflections-on-Common-Lisp-and-R-td920197.html#a920197
it is quite interesting to note that such a “provactive” post from one of R’s originators got 0 response from R-dev list………..
-
Business Intelligence Tools: looking at R as a platform for big BI. - SkriptFounders wrote:[...] is some more information I thought was nice on the best packages for stat analysis. The only thing thats wrong here is the [...]
23. May 2010 at 5:36 am :
-
Sam wrote:I came across this thread and I’m finding the comments very useful. Thanks to all!
16. June 2010 at 4:12 pm :
I’m trying to decide which software package to use. I’m a researcher working with clinical (patient-related) data. I have data sets with <10,000 rows (usually just a few thousand). I need software that will generate multivariate and logistic regression, and Kaplan-Meier survival curves. Visualization is very important.
Of note, I’m an avid programmer as a hobby (C++, assembly, most anything), so I’m very comfortable with a more complex package, but I need something that just works. I’ve been using SPSS, which works, but clunky.
Any suggestions? Stata? Systat? S-Plus? Maple?
-
I still haven’t used Stata, but its users have very strong praise for it, for
situations that sound like yours. That might be the best option to start
with.
R might be worth trying too.
-
Rashad wrote:I am working on my undergraduate degree in statistics in the SAS direction, which has surprised people in the field I meet. The choice was somewhat arbitrary; I just wanted something applied to complement a pure mathematics degree. This post has opened many (…..many) options to consider. Thanks for the great discussion.
5. July 2010 at 12:48 am :
-
Donovan wrote:So my question here is simple:
2. August 2010 at 2:44 am :
After you peel back all the layers and look at the solution that would require the least effort, the most power, with the greatest flexibility, why anyone would choose anything other than RPy first, and then the language du joire that your employer would be using second as a backup and scrap the code war?
I mean for my money, you make sure you can build a model in Excel, learn RPy & C# and search for APIs if you need to user other languages or just plain partner with someone who can code C++ {if you can’t} and simply inject it.
I mean I plan on learning Java, PHP and SAS as well, but that is really a personal choice. Coming from IT with in Finance, not knowing Java and SAS means you either won’t get in the door or reach a glass ceiling pretty quickly unless you play corporate politics really, really well. So for me, it is a necessity. But the flip side is, wanting to make the leap into Financial Engineering after completing a doctorate in Engineering, RPy has also become a near Realistically, unless you just like coding, I have to say that what I have suggested makes the most sense for the average analysis pro. But then alot of this is based upon whether you’re a Quant Research, Quant Developer, Analyst, etc. — different tools for different functions.
Just thought
-
Mark Smith wrote:Sas and r
14. August 2010 at 11:06 pm :
1. there is a book out on the topic (http://www.amazon.com/gp/product/1420070576?ie=UTF8&tag=sasandrblog-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1420070576)
2. R interface available in SAS 9.2
“While SAS is committed to providing the new statistical methodologies that the marketplace demands and will deliver new work more quickly with a recent decoupling of the analytical product releases from Base SAS, a commercial software vendor can only put out new work so fast. And never as as fast as a professor and a grad student writing an academic implementation of brand-new methodology.
Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.
“We know a lot of our users have both R and SAS in their toolkit, and we decided to make it easier for them to access R by making it available in the SAS 9.2 environment,” said Rodriguez.
The SAS/IML Studio interface allows you to integrate R functionality with SAS/IML or SAS programs. You can also exchange data between SAS and R as data sets or matrices.
“This is just the first step,” said Radhika Kulkarni, Vice President of Advanced Analytics. “We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure.“
http://support.sas.com/rnd/app/studio/Rinterface2.html
While this is probably more for SAS users than R, I thought both camps might be interested in case you get coerced into using SAS one day… doesn’t mean you have to give up your experience with R.
-
Iskander wrote:I am also amazed how few people here have said anything about StatSoft Statistica. I’ve been using it for close to 6 years and don’t see any shortcomings at all. Consider this:
26. August 2010 at 4:51 pm :
- full support of R
- fully scriptable, which means you can call DLLs written in whatever programming language possible and implementing things which you didn’t find inbuilt in Statistica (which doesn’t mean it’s not there)
- the Statistica solver / engine can be called externally from Excel and other applications via the COM/OLE interface
- untrammelled graphics of virtually any complexity — extremely flexible and customizable (and scriptable)
- the Data Miner (with its brand new ‘Data Miner Recipes’) is another extremely powerful tool that leaves only your imagination to limit you
….it would be tedious to list all its advantages (again, the Statistica Neural Networks and the Six Sigma modules are IMO very professionally implemented).
-
ZZ wrote:No package other than sas can load the unstructured data like blogs posted here, analyze and extract the sentiments (positive, negative, neutral) about each of the packages debated here in pretty decent precision after sas bought teragram a few years ago.
31. August 2010 at 12:08 pm :
-
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata – Brendan O�… Excellent comparison between data analysis
packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata. (tags: python r matlab)
[...]
-
Interesting Comparison of data analysis packages - CCPR Computing wrote:[...] http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-st... Uncategorized none [...]
23. September 2010 at 10:31 pm :
-
[...] 1. Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS,
SPSS, Stata [...]
-
John wrote:A post above commented: “sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?” A good relational database is good at supporting online transactional processing and will in most organizations come with a bureaucracy of gatekeepers whose role is to ensure the integrity of the database to support mission critical transactional applications. In other words it takes a mountain of paperwork to merely add one field to a table. The paradigm assume a business area of ‘users’ who have their requirements spelled out before anyone even thinks of designing let alone programming anything. It just kills analysis. Where SAS is used data must be extracted from such systems and loaded into text files for SAS to read, or SAS/Access used. Generally DBAs are loath to install the latter as it is difficult to optimize in the sense of minimizing the drain on operational systems.
17. October 2010 at 4:40 am :
On IBM mainframes the choice of languages to use is limited and by default this will usually be SAS. Most large organisations have SAS, at least Base SAS, installed by default because the Merrill MXG capacity planning software uses it. Hence cost is sort of irrelevant. It then tends to be used for anything requiring processing of text files even in production applications and this often means processing text as text, e.g. JCL with date dependent parameters, rather than as preparing data for loading into SAS datasets for statistical analysis.
I know nothing about R but seeing a few code sample it struck me how it resembled APL to which we were introduced in our stats course in college in the early 70s, not surprising as both are matrix oriented.
Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.Name | Advantages | Disadvantages | Open source? | Typical users |
R | Library support; visualization | Steep learning curve | Yes | Finance; Statistics |
Matlab | Elegant matrix support; visualization | Expensive; incomplete statistics support | No | Engineering |
SciPy/NumPy/Matplotlib | Python (general-purpose programming language) | Immature | Yes | Engineering |
Excel | Easy; visual; flexible | Large datasets | No | Business |
SAS | Large datasets | Expensive; outdated programming language | No | Business; Government |
Stata | Easy statistical analysis | No | Science | |
SPSS | Like Stata but more expensive and worse |
There’s a bunch more to be said for every cell. Among other things:
- Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
- Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
- Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
- Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
- I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
- Here’s an interesting reddit thread on SAS/Stata vs R.
- SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
- SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
- Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
- But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
- If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
- (This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
- SAS people complain about poor graphing capabilities.
- R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
- Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
- Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
- Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
- Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
- I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.
•
114 comments to “Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata”
-
Eric Sun wrote:>>I know dozens of people under 30 doing statistical stuff and only one knows SAS.
23. February 2009 at 8:53 pm :
I’m assuming the “one” is me, so I’ll just say a few points:
I’m taking John Chambers’s R class at Stanford this quarter, so I’m slowly and steadily becoming an R convert.
That said, I don’t think anything besides SAS can do well with datasets that don’t fit in memory. We used SAS in litigation consulting because we frequently had datasets in the 1-20 GB range (i.e. can fit easily on one hard disk but difficult to work with in R/Stata where you have to load it all in at once) and almost never larger than 20GB. In this relatively narrow context, it makes a lot of sense to use SAS: it’s very efficient and easy to get summary statistics, look at a few observations here and there, and do lots of different kinds of analyses. I recall a Cournot Equilibrium-finding simulation that we wrote using the SAS macro language, which would be quite difficult in R, I think. I don’t have quantitative stats on SAS’s capabilities, but I would certainly not think twice about importing a 20 GB file into SAS and working with it in the same way as I would a 20 MB file.
That said, if you have really huge internet-scale data that won’t fit on one hard drive, then SAS won’t be too useful either. I’ll be very interested if this R + Hadoop system ever becomes mature: http://www.stat.purdue.edu/~sguha/rhipe/
In my work at Facebook, Python + RPy2 is a good solution for large datasets that don’t need to be loaded into memory all at once (for example, analyzing one facebook network at a time). If you have mutliple machines, these computations can be speeded up using iPython’s parallel computing facilities.
Also, R’s graphical capabilities continue to surprise me; you can actually do a lot of advanced stuff. I don’t do much graphics, but perhaps check out “R Graphics” by Murrell or Deepayan Sarkar’s book on Lattice Graphics.
-
Eric Sun wrote:I thought that most people consider SAS to have the highest learning curve, certainly higher than R. but maybe I’m mistaken about that.
23. February 2009 at 8:55 pm :
-
Justin wrote:Calling scipy immature sounds somehow “wrong”. The issues you come up with are more of early design flaws that will not go away, no matter how “mature” scipy is getting.
23. February 2009 at 10:24 pm :
That said, these are flaws, but they seem pretty minor to me.
-
I’ve recently seen GNU DAP mentioned as an open-source equivalent to SAS.
Know if it’s any good?
-
TS Waterman wrote:Have you considered Octave in this regard? It’s a GNU-licensed Matlab clone. Very nice graphing capability, Matlab syntax and library functions, open source.
23. February 2009 at 10:49 pm :
http://www.gnu.org/software/octave/FAQ.html#MATLAB-compatibility
-
@Eric - oops, yeah should’ve put SAS as hardest. Good point that the standard
of judging how good large dataset support is, is whether you can manipulate a
big dataset the same way you manipulate a small dataset. I’ve loaded 1-2 GB of
data into R and you definitely have to do things differently (e.g. never use
by()).
@Justin - scipy certainly seems like it keeps improving. I just keep comparing it to matlab and it’s constantly behind. I remember once watching someone try to make a 3d plot. He spent quite a while going through various half-baked python solutions that didn’t work. Then he booted up matlab and had one in less than a minute. Matlab’s functionality is well-designed, well-put-together and well-documented.
@Edward - I have seen it mentioned too. From glancing at its home page, it seems like a pretty small-time project.
-
@TS - yeah, i used octave just once for something simple. it worked fine. my
issues were: first, i’m not impressed with gnuplot graphing. second, the
interactive environment isn’t too great. third, trying to clone the matlab
language seems crazy since it’s kind of crappy. i think i’d usually pick scipy
over octave if being free is a requirement, else go with matlab if i have access
to it.
otoh it looks like it supports some nice things like sparse matrices that i’ve had a hard time with lately in R and scipy. i guess worth another look at some point…
-
Brendan,
Nice overview, I think another dimension you don’t mention — but which Bo Cowgill alluded to at our R panel talk — is performance. Matlab is typically stronger in this vein, but R has made significant progress with more recent versions. Some benchmark results can be found at:
http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php
MD
-
Mike wrote:In high energy particle physics, ROOT is the package of choice. It’s distributed by CERN, but it’s open source, and is multi-platform (though the Linux flavor is best supported). It does solve some of the problems you mentioned, like running over large datasets that can’t be entirely memory-resident. The syntax is C++ based, and has both an interpreter and the ability to compile/execute scripts from the command line.
23. February 2009 at 11:27 pm :
There are lots of reasons to prefer other packages (like R) over ROOT for certain tasks, but in the end there’s little that can be done with other packages that one cannot
Name | Advantages | Disadvantages | Open source? | Typical users |
R | Library support; visualization | Steep learning curve | Yes | Finance; Statistics |
Matlab | Elegant matrix support; visualization | Expensive; incomplete statistics support | No | Engineering |
SciPy/NumPy/Matplotlib | Python (general-purpose programming language) | Immature | Yes | Engineering |
Excel | Easy; visual; flexible | Large datasets | No | Business |
SAS | Large datasets | Expensive; outdated programming language | No | Business; Government |
Stata | Easy statistical analysis | No | Science | |
SPSS | Like Stata but more expensive and worse |
[7/09 update: tweaks incorporating some of the excellent comments below, esp. for SAS, SPSS, and Stata.]
- United States
- [ change ]
- Home
- Solutions
- A
smarter planet
- By industry
- Aerospace and defense
- Automotive
- Banking
- Chemicals and petroleum
- Construction
- Consumer products
- Education
- Electronics
- Energy and utilities
- Financial markets
- Forest and paper
- Government
- Healthcare
- Insurance
- Life sciences
- Media and entertainment
- Metals and mining
- Retail
- Telecommunications
- Travel and transportation
- By industry
- By business need
- By IT issue
- By Business Partner
- Business consulting
- Cloud computing
- Dynamic infrastructure
- Financing
- Green: energy, environment & sustainability
- Service oriented architecture (SOA)
- A
smarter planet
- Services
- IT services
- All IT services
- Business continuity and resiliency services
- End-user services
- Integrated communications services
- IT strategy and architecture services
- Maintenance and technical support services
- Middleware services
- Security services
- Server services
- Site and facilities services
- Storage and data services
- IT services A-Z
- Business consulting
- Application services
- Outsourcing services
- Training
- More
- IT services
- Products
SPSS predictive analytics software and solutions
Predictive analytics helps your organization anticipate change so that you can plan and carry out strategies that improve outcomes. By applying predictive analytics solutions to data you already have, your organization can uncover unexpected patterns and associations and develop models to guide front-line interactions. This means you can prevent high-value customers from leaving, sell additional services to current customers, develop successful products more efficiently, or identify and minimize fraud and risk. Predictive analytics gives you the knowledge to predict…and the power to act.
Predictive analytics helps your organization anticipate change so that you can plan and carry out strategies that improve outcomes. By applying predictive analytics solutions to data you already have, your organization can uncover unexpected patterns and associations and develop models to guide front-line interactions. This means you can prevent high-value customers from leaving, sell additional services to current customers, develop successful products more efficiently, or identify and minimize fraud and risk. Predictive analytics gives you the knowledge to predict…and the power to act.
Learn more about IBM® SPSS® software
- IBM SPSS Statistics puts the power of advanced
statistical analysis in your hands.
- With IBM SPSS Modeler, you can quickly discover patterns and
trends in your data more easily, using a unique visual interface supported by
advanced analytics.
- Get an accurate view of people's attitudes, preferences, and opinions with
IBM SPSS Data Collection.
- Use IBM SPSS Deployment (link resides outside of ibm.com) products to drive
high-impact decisions by making analytics a vital part of your
business.
SPSS与SAS的客观评价
(转自丁香园徐天和教授讲座)
SAS
介绍
SAS的一般情况
SAS的一般情况
Ú 制作单位:SAS公司
Ú 最新版本号:9.0(英文版)、8.2(中文版)
Ú 系统大小:>1G(完全安装)
Ú 英文网址 http://www.sas.com
Ú 中-文-网址 http://www.sas.com/offices/asiapacific/china
SAS的名称及含义
¨ SAS系统全称为Statistics Analysis System,最早由北卡罗来纳大学的两位生物统计学研究生编制,并于1976年成立了SAS软件研究所,正式推出了SAS软件。
¨ SAS是用于决策支持的大型集成信息系统,但该软件系统最早的功能限于统计分析,至今,统计分析功能也仍是它的重要组成部分和核心功能。
SAS概述一
¨ 经过多年的发展,SAS已被全世界120多个国家和地区的近三万家机构所采用,直接用户则超过三百万人,遍及金融、医药卫生、生产、运输、通讯、政 * 府和教育科研等领域。
¨ 在英美等国,能熟练使用SAS进行统计分析是许多公司和科研机构选材的条件之一。
SAS概述二
¨ 在数据处理和统计分析领域,SAS系统被誉为国际上的标准软件系统,并在96~97年度被评选为建立数据库的首选产品。堪称统计软件界的巨无霸。
¨ 例如:在以苛刻严格著称于世的美国FDA新药审批程序中,新药试验结果的统计分析规定只能用SAS进行,其他软件的计算结果一律无效!哪怕只是简单的均数和标准差也不行!由此可见SAS的权威地位。
SAS的特点一
¨ SAS是由大型机系统发展而来,其核心操作方式就是程序驱动,经过多年的发展,现在已成为一套完整的计算机语言,其用户界面也充分体现了这一特点:它采用MDI(多文档界面),用户在PGM视窗中输入程序,分析结果以文本的形式在OUTPUT视窗中输出。
SAS的特点二
¨ 使用程序方式,用户可以完成所有需要做的工作,包括统计分析、预测、建模和模拟抽样等。
¨ 但是,这使得初学者在使用SAS时必须要学习SAS语言,入门比较困难。
SAS的特点三
¨ SAS的Windows版本根据不同的用户群开发了几种图形操作界面,这些图形操作界面各有特点,使用时非常方便。但是由于国内介绍他们的文献不多,并且也不是SAS推广的重点,因此还不为绝大多数人所了解。
SAS的缺点一
¨ 由于SAS系统是从大型机上的系统发展而来,在设计上也是完全针对专业用户进行设计,因此其操作至今仍以编程为主,人机对话界面不太友好,并且在编程操作时需要用户最好对所使用的统计方法有较清楚的了解,非统计专业人员掌握起来较为困难。
SAS的缺点二
¨ 此外,SAS极为高昂的价格和只租不卖的销售策略使得实力不足的个人和机构只能对他望而却步。
SPSS 介绍
SPSS的一般情况
¨ 制作单位:SPSS公司
¨ 最新版本号:11.5
¨ 系统大小:200兆(完全安装)
¨ 英文网址:http://www.spss.com
¨ 中-文-网址:http://www.spss.com.cn
SPSS的名称及含义
¨ SPSS是软件英文名称的首字母缩写,原意为Statistical Package for the Social Sciences,即“社会科学统计软件包”。
¨ 随着SPSS产品服务领域的扩大和服务深度的增加,SPSS公司已于2000年正式将英文全称更改为Statistical Product and Service Solutions,意为“统计产品与服务解决方案”,标志着SPSS的战略方向正在做出重大调整。
SPSS的特点一
¨ SPSS最突出的特点
– 以窗口方式管理数据;
– 以菜单方式展示各种分析方法;
– 以对话框展示出各种功能选择项。
SPSS的特点二
¨ 只要掌握一定的Windows操作技能,粗通统计分析原理,就可以使用该软件为特定的科研工作服务。
¨ 操作界面极为友好,输出结果美观漂亮(从国外的角度看),是非专业统计人员的首选统计软件。
SPSS的特点三
¨ SPSS采用类似EXCEL表格的方式输入与管理数据,数据接口较为通用,能方便的从其他数据库中读入数据。其统计过程包括了常用的、较为成熟的统计过程,完全可以满足非统计专业人士的工作需要。
SPSS的特点四
¨ 对于熟悉老版本编程运行方式的用户,SPSS还特别设计了语法生成窗口,用户只需在菜单中选好各个选项,然后按“粘贴”按钮就可以自动生成标准的SPSS程序。极大的方便了中、高级用户。
SPSS的缺点一
¨ 由于在SPSS公司的产品线中,SPSS软件属于中、低档(SPSS公司共有二十余个产品),因此从战略的观点来看,SPSS显然是把相当的精力放在了用户界面的开发上。
SPSS的缺点二
¨ 该软件只吸收较为成熟的统计方法,而对于最新的统计方法,SPSS公司的做法是为之发展一些专门软件,如针对树结构模型的Answer Tree,针对神经网络技术的Neural Connection、专门用于数据挖掘的Clementine等,而不是直接纳入SPSS,因此他们在SPSS中均难觅芳踪。
SPSS的缺点三
¨ 另外,其输出结果虽然漂亮,但不能为WORD等常用文字处理软件直接打开,只能采用拷贝、粘贴的方式加以交互。这些都可以说是SPSS软件的致命伤。
SAS——编程——数学 收藏
SAS编程究竟有何特点,与其它编程有什么联系和区别,为什么易学易用并且稳定?
SAS编程究竟有何特点,与其它编程有什么联系和区别,为什么易学易用并且稳定?
SAS软件的宗旨是为所有需要进行数据处理、数据分析的计算机或者非计算机工作人员提供一种易学易用、完整可靠的软件系统。SAS语言本身是一种非过程语言(第四代语言),类似于C语言,且综合了各种高级语言的功能和灵活的格式,将数据处理和统计分析融合于一体。
所以,SAS系统有其独特的编程步骤和语言,其最大的特点是:简单、易学,语句的针对性强,依赖仅有的DATA STEP和PROC
STEP,灵活的语句和步骤组合,即可解决从数据读取、处理、分析、表达、连接中的任何简单或复杂的问题。
SAS的编程语言是最令人赞赏的,只有不断丰富,而不象一些计算机语言会出现过时,甚至被淘汰,原因是SAS随计算机语言的改变而更新其支持语言。比如:V8采用C++语言支持,而新版的V9将有JAVA支持的版本,实现数据网上连接、读取、处理、分析、表达等全过程。所以,SAS编程可以不断累积经验,不必疲于操心计算机语言的更新换代。
我适合学SAS吗?学习SAS需要懂高等数学和统计学吗?
这是许多朋友关心的话题。实际情况是,任何专业背景的人,都可以学习并且掌握SAS,一经学会,终生受用。
不少人听说SAS是个统计分析软件系统,就自认为需要懂得高深数学和统计学的人才能学习,其实这是一个误区。SAS在创办起,其宗旨是着重于80%工作量的统计分析前的数据处理,至于统计分析一旦数据就绪,通过相应的分析模块,几乎象傻瓜相机一样,谁都可以操作运行。如果需要的统计方法学上提高或突破,恐怕不是统计分析的日常工作,而是统计方法学的科研了。
发表于 @ 2007年08月19日 10:38:00 | 评论( 1 ) | 编辑| 举报| 收藏
旧一篇:OLAP的多维数据分析 | 新一篇:摩根士丹利及赢利模式研究
查看最新精华文章 请访问博客首页相关文章
C/C++是程序员必须掌握的语言吗?Basic大旗不倒·谭浩强数据分析师/市场调研物业管理软件为什么选择FORTRAN(kosmos.cn)什么是托管代码R---用于统计计算和统计制图的优秀工具,S 语言的一种
查看最新精华文章 请访问博客首页相关文章
C/C++是程序员必须掌握的语言吗?Basic大旗不倒·谭浩强数据分析师/市场调研物业管理软件为什么选择FORTRAN(kosmos.cn)什么是托管代码R---用于统计计算和统计制图的优秀工具,S 语言的一种
实现Boom - 我写的软件合集 - 持续10年agri521 发表于2008年8月28日 23:20:10
IP:举报回复删除
我眼中的SAS发表评论表 情: 评论内容: 用 户 名:登录 注册 匿名评论 匿名用户验 证 码: 重新获得验证码
热门招聘职位【深圳好伴电子商务】高薪诚聘:PHP、网络前端工程师、网页设计工程师!【Google】Google诚招技术精英,史上人数最多职位最广!【敦煌网】诚聘研发类职位:Java、PHP、网站架构师、运维工程师 等职位【柯达(Kodak)】 诚聘C++(视频图像处理)/C#/Java Engineer,(工作地点:上海)【新太科技】高薪诚聘各类软件开发人才(工作地点:广州,北京)【热聘】搜狐畅游全国热招开发工程师【网易杭州】技术类职位大招聘:c++、java、信息安全工程师等职位热招中【百度】诚聘 Web研发/工程师 一个舞台,让你的想法去成为现实!!!【上海我友】福利购房计划+高薪+期权,邀您共创互联网的奇迹!【NHN China】诚聘QA工程师/软件开发工程师, 急聘!高薪诚聘!【CSDN】诚聘 网编/网编实习生/UI设计师/广告销售【Synopsys】全球知名EDA软件设计公司,诚聘 C/C++/IC Engineer【深圳好伴电子商务】高薪诚聘:PHP、网络前端工程师、网页设计工程师!【Google】Google诚招技术精英,史上人数最多职位最广!【敦煌网】诚聘研发类职位:Java、PHP、网站架构师、运维工程师 等职位【柯达(Kodak)】 诚聘C++(视频图像处理)/C#/Java Engineer,(工作地点:上海)【新太科技】高薪诚聘各类软件开发人才(工作地点:广州,北京)【热聘】搜狐畅游全国热招开发工程师【网易杭州】技术类职位大招聘:c++、java、信息安全工程师等职位热招中【百度】诚聘 Web研发/工程师 一个舞台,让你的想法去成为现实!!!【上海我友】福利购房计划+高薪+期权,邀您共创互联网的奇迹!【NHN China】诚聘QA工程师/软件开发工程师, 急聘!高薪诚聘!【CSDN】诚聘 网编/网编实习生/UI设计师/广告销售【Synopsys】全球知名EDA软件设计公司,诚聘 C/C++/IC Engineer 公司简介|招贤纳士|广告服务|银行汇款帐号|联系方式|版权声明|法律顾问|问题报告
北京创新乐知广告有限公司 版权所有, 京 ICP 证 070598 号
世纪乐知(北京)网络技术有限公司 提供技术支持
江苏乐知网络技术有限公司 提供商务支持
Email:webmaster@csdn.net
Copyright © 1999-2010, CSDN.NET, All Rights Reserved
我眼中的SAS发表评论表 情: 评论内容: 用 户 名:登录 注册 匿名评论 匿名用户验 证 码: 重新获得验证码
热门招聘职位【深圳好伴电子商务】高薪诚聘:PHP、网络前端工程师、网页设计工程师!【Google】Google诚招技术精英,史上人数最多职位最广!【敦煌网】诚聘研发类职位:Java、PHP、网站架构师、运维工程师 等职位【柯达(Kodak)】 诚聘C++(视频图像处理)/C#/Java Engineer,(工作地点:上海)【新太科技】高薪诚聘各类软件开发人才(工作地点:广州,北京)【热聘】搜狐畅游全国热招开发工程师【网易杭州】技术类职位大招聘:c++、java、信息安全工程师等职位热招中【百度】诚聘 Web研发/工程师 一个舞台,让你的想法去成为现实!!!【上海我友】福利购房计划+高薪+期权,邀您共创互联网的奇迹!【NHN China】诚聘QA工程师/软件开发工程师, 急聘!高薪诚聘!【CSDN】诚聘 网编/网编实习生/UI设计师/广告销售【Synopsys】全球知名EDA软件设计公司,诚聘 C/C++/IC Engineer【深圳好伴电子商务】高薪诚聘:PHP、网络前端工程师、网页设计工程师!【Google】Google诚招技术精英,史上人数最多职位最广!【敦煌网】诚聘研发类职位:Java、PHP、网站架构师、运维工程师 等职位【柯达(Kodak)】 诚聘C++(视频图像处理)/C#/Java Engineer,(工作地点:上海)【新太科技】高薪诚聘各类软件开发人才(工作地点:广州,北京)【热聘】搜狐畅游全国热招开发工程师【网易杭州】技术类职位大招聘:c++、java、信息安全工程师等职位热招中【百度】诚聘 Web研发/工程师 一个舞台,让你的想法去成为现实!!!【上海我友】福利购房计划+高薪+期权,邀您共创互联网的奇迹!【NHN China】诚聘QA工程师/软件开发工程师, 急聘!高薪诚聘!【CSDN】诚聘 网编/网编实习生/UI设计师/广告销售【Synopsys】全球知名EDA软件设计公司,诚聘 C/C++/IC Engineer 公司简介|招贤纳士|广告服务|银行汇款帐号|联系方式|版权声明|法律顾问|问题报告
北京创新乐知广告有限公司 版权所有, 京 ICP 证 070598 号
世纪乐知(北京)网络技术有限公司 提供技术支持
江苏乐知网络技术有限公司 提供商务支持
Email:webmaster@csdn.net
Copyright © 1999-2010, CSDN.NET, All Rights Reserved
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/cRyIng_gG/archive/2007/08/19/1750251.aspx
什么是托管代码 收藏
在“过去”(只是几年前),使用C和C++编写代码的开发人员不得不自己进行内存管理。当不再需要已分配的内存空间时,必须将其释放,除非希望该内存被“泄漏”,内存泄漏将带来严重的性能问题。更
在“过去”(只是几年前),使用C和C++编写代码的开发人员不得不自己进行内存管理。当不再需要已分配的内存空间时,必须将其释放,除非希望该内存被“泄漏”,内存泄漏将带来严重的性能问题。更
糟糕的是因为直接处理指针,而它很容易破坏项目正在使用的内存。在很多情况下,这将导致很长时间的故障调试,因为通常实际看到出错的地方并不是内存初始被破坏的地方。
人们认为C和C++语言难于掌握,主要是因为具有很多这种类型的问题。许多开发人员不愿意尝试C和C++,也是因为这个原因,他们尝试使用其他没有这些令人头痛问题的高级语言,例如Visual
Basic。尽
管这些新语言具有易用易学的优点,但也具有一些缺点。它们的性能无法与C和C++语言相比,在大多数情况下显得特别慢。另外,因为底层操作系统是使用C++开发的,所以这些语言难以实现C++的所有
功能。尽管可以使用它们处理很多非常好的工作,但是如果想要获得操作系统的所有性能和优势,只能依靠自己。
与.NET运行库的第一个版本相比,.NET的大多数内容都已经改变了。Microsoft公司几乎完全重新设计了一种新的API,竭力确保开发人员关心的问题都会被解决。这种新的运行库必须易学易用,快速高
效,并且不存在令人头痛的内存管理问题。在本书中,将看到.NET在这些方面的好处。
托管代码 (managed code)
由公共语言运行库环境(而不是直接由操作系统)执行的代码。托管代码应用程序可以获得公共语言运行库服务,例如自动垃圾回收、运行库类型检查和安全支持等。这些服务帮助提供独立于平台和语言
由公共语言运行库环境(而不是直接由操作系统)执行的代码。托管代码应用程序可以获得公共语言运行库服务,例如自动垃圾回收、运行库类型检查和安全支持等。这些服务帮助提供独立于平台和语言
的、统一的托管代码应用程序行为。
如C#
如C#
非托管代码 (unmanaged
code)
在公共语言运行库环境的外部,由操作系统直接执行的代码。非托管代码必须提供自己的垃圾回收、类型检查、安全支持等服务;它与托管代码不同,后者从公共语言运行库中获得这些服务。如C++,C
在公共语言运行库环境的外部,由操作系统直接执行的代码。非托管代码必须提供自己的垃圾回收、类型检查、安全支持等服务;它与托管代码不同,后者从公共语言运行库中获得这些服务。如C++,C
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/dadalan/archive/2008/12/05/3443466.aspx
23. February 2009 at 8:53 pm :
I’m assuming the “one” is me, so I’ll just say a few points:
I’m taking John Chambers’s R class at Stanford this quarter, so I’m slowly and steadily becoming an R convert.
That said, I don’t think anything besides SAS can do well with datasets that don’t fit in memory. We used SAS in litigation consulting because we frequently had datasets in the 1-20 GB range (i.e. can fit easily on one hard disk but difficult to work with in R/Stata where you have to load it all in at once) and almost never larger than 20GB. In this relatively narrow context, it makes a lot of sense to use SAS: it’s very efficient and easy to get summary statistics, look at a few observations here and there, and do lots of different kinds of analyses. I recall a Cournot Equilibrium-finding simulation that we wrote using the SAS macro language, which would be quite difficult in R, I think. I don’t have quantitative stats on SAS’s capabilities, but I would certainly not think twice about importing a 20 GB file into SAS and working with it in the same way as I would a 20 MB file.
That said, if you have really huge internet-scale data that won’t fit on one hard drive, then SAS won’t be too useful either. I’ll be very interested if this R + Hadoop system ever becomes mature: http://www.stat.purdue.edu/~sguha/rhipe/
In my work at Facebook, Python + RPy2 is a good solution for large datasets that don’t need to be loaded into memory all at once (for example, analyzing one facebook network at a time). If you have mutliple machines, these computations can be speeded up using iPython’s parallel computing facilities.
Also, R’s graphical capabilities continue to surprise me; you can actually do a lot of advanced stuff. I don’t do much graphics, but perhaps check out “R Graphics” by Murrell or Deepayan Sarkar’s book on Lattice Graphics.
23. February 2009 at 8:55 pm :
23. February 2009 at 10:24 pm :
That said, these are flaws, but they seem pretty minor to me.
23. February 2009 at 10:42 pm :
23. February 2009 at 10:49 pm :
http://www.gnu.org/software/octave/FAQ.html#MATLAB-compatibility
23. February 2009 at 10:52 pm :
@Justin - scipy certainly seems like it keeps improving. I just keep comparing it to matlab and it’s constantly behind. I remember once watching someone try to make a 3d plot. He spent quite a while going through various half-baked python solutions that didn’t work. Then he booted up matlab and had one in less than a minute. Matlab’s functionality is well-designed, well-put-together and well-documented.
@Edward - I have seen it mentioned too. From glancing at its home page, it seems like a pretty small-time project.
23. February 2009 at 10:58 pm :
otoh it looks like it supports some nice things like sparse matrices that i’ve had a hard time with lately in R and scipy. i guess worth another look at some point…
23. February 2009 at 11:05 pm :
Nice overview, I think another dimension you don’t mention — but which Bo Cowgill alluded to at our R panel talk — is performance. Matlab is typically stronger in this vein, but R has made significant progress with more recent versions. Some benchmark results can be found at:
http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php
MD
23. February 2009 at 11:27 pm :
There are lots of reasons to prefer other packages (like R) over ROOT for certain tasks, but in the end there’s little that can be done with other packages that one cannot do with ROOT.
24. February 2009 at 12:32 am :
Python actually really shines above the others for handling large datasets using memmap files or a distributed computing approach. R obviously has a stronger statistics user base and more complete libraries in that area - along with better “out-of-the-box” visualizations. Also, some of the benefits overlap - using numpy/scipy you get that same elegant matrix support / syntax that matlab has, basically slicing arrays and wrapping lapack.
The advantages of having a real programming language and all the additional non-statistical libraries & frameworks available to you make Python the language of choice for me. If there is something scipy is weak at that I need, I’ll also use R in a pinch or move down to C. I think you are basically operating at a disadvantage if you are using the other packages at this point. The only other reason I can see to use them is if you have no choice, for example if you inherited a ton of legacy code within your organization.
24. February 2009 at 3:09 am :
As for Mathematica, I haven’t used it for statistics beyond some basic support for common distributions. But one thing it does well is very consistent syntax. I used it when it first came out, then didn’t use if for years, and then started using it again. When I came back to it, I was able to pick it up right where I left off. I can’t put R down for a week and remember the syntax. Mathematica may not do everything, but what it does do, it does elegantly.
24. February 2009 at 6:34 am :
24. February 2009 at 6:34 am :
@Mike - ah yes, i remember looking at ROOT a long time ago and thinking it was impressive. But then I forgot about it because all the cs/stats people whose stuff I usually read don’t know about it. I think it just goes to show you that the data analysis tools problem is tackled so differently by different groups of people, it’s very easy to not miss out on better options just due to lack of information!
@Pete - yeah I whine about python. but I seem to use numpy plenty still :) actually its freeness is a huge win over matlab for cluster environments since you dont’ have to pay for a zillion licenses…
Hm I seem to be talking myself into thinking it’s down to R vs Python vs Matlab. then the rosetta stone http://mathesaurus.sourceforge.net/matlab-python-xref.pdf should be my guide…
@John - very interesting. I think many R users have had the experience of quickly forgetting how to do basic things.
24. February 2009 at 9:33 am :
> Nice comparison. I would add to the pros of R/Python that the data
> structures are much richer than Matlab. The big pro of Matlab still
> seems to be performance (and maybe the GUI for some people). On top of
> being expensive Matlab is a nightmare if you want to run a program on
> lots of nodes because you need a license for every node!
>
> It’s 2008b I did the comparison with - I should mention that!
24. February 2009 at 2:45 pm :
24. February 2009 at 9:27 pm :
Re David Knowles’ comment…
There are specialized parallel/distributed computing tools available from MathWorks for writing large-scale applications (for clusters, grid etc.). You should check out: http://www.mathworks.com/products/parallel-computing.
Running full-fledged desktop MATLAB on a huge number of nodes is messy and of course very expensive not to mention that a single user would take away several licenses for which other users will have to wait.
Disclosure: I work for the parallel computing team at The MathWorks
25. February 2009 at 12:27 am :
On Tue, Feb 24, 2009 at 7:20 AM, Scott Hirsch
>> Brendan –
>>
>> Thanks for the interesting discussion you got rolling on several popular
>> data analysis packages
[...]
>> I’m always very interested to hear the perspectives of MATLAB users, and
>> appreciate your comments about what you like and what you don’t like. I was
>> interested in following up on this comment:
>>
>> “Matlab’s language is certainly weak. It sometimes doesn’t seem to be
>> much more than a scripting language wrapping the matrix libraries. “
>>
>> I have my own assumptions about what you might mean, but I’d be very
>> interested in hearing your perspectives here. I would greatly appreciate it
>> if you could share your thoughts on this subject.
>
> sure. most of my experiences are with matlab 6. just briefly,
>
> * leave out semicolon => print the expression. that is insane.
> * each function has to be defined in its own file
> * no optional arguments
> * no named arguments
> * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> * no perl/python-style hashes
> * no object orientation (or just a message dispatch system) … less
> important
> * poor/no support for text
> * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
On Tue, Feb 24, 2009 at 11:27 AM, Scott Hirsch
> Thanks, Brendan. This is very helpful. Some of the things have been
> addressed, but not all. Here are some quick notes on where we are today.
> Just to be clear – I have no intention (or interest) in changing your
> perspectives, just figured I could let you know in case you were curious.
>
>
>
> > * leave out semicolon => print the expression. that is insane.
> No plans to change this. Our solution is a bit indirect, but doesn’t break
> the behavior that lots of users have come to expect. We have a code
> analysis tool (M-Lint) that will point out missing semi-colons, either while
> you are editing a file, or in a batch process for all files in a directory.
>
> > * each function has to be defined in its own file
> You can include multiple functions in a file, but it introduces unique
> semantics – primarily that the scope of these functions is limited to within
> the file.
[[ addendum from me: yeah, exactly. if you want to make functions that are
shared in different pieces of your code, you usually have to do 1 function per
file. ]]
> > * no optional arguments
> Nothing yet.
>
> > * no named arguments
> Nope.
>
> > * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> We’ve had structures since MATLAB 5.
[[ addendum from me: well, structures aren't very conventional in standard
matlab style, or at least certainly not the standard library. most algorithm
functions return a tuple of variables, instead of packaging things together
into a structure. ]]
> > * no perl/python-style hashes
> We just added a Map container last year.
>
> > * no object orientation (or just a message dispatch system) … less
> important
> We had very weak OO capabilities in MATLAB 6, but introduced a modern system
> in R2008a.
>
> > * poor/no support for text
> This has gotten a bit better, primarily through the introduction of regular
> expressions, but can still be awkward.
>
> > * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
> Not much here, other than a smattering (Database Toolbox for SQL,
> miscellaneous commands for web interaction, WSDL, …)
>
> Thanks again. I really do appreciate getting your perspective. It’s
> helpful for me to understand how MATLAB is perceived.
>
> -scott
25. February 2009 at 12:38 am :
25. February 2009 at 11:30 am :
http://www.scipy.org/NumPy_for_Matlab_Users
As of the last year, a standard ipython install ( “easy_install IPython[kernel]” ) now includes parallel computing right out of the box, no licenses required:
http://ipython.scipy.org/doc/rel-0.9.1/html/parallel/index.html
If this is going to turn into a performance shootout, then I’ll add that from what I’ve seen Python with numpy/scipy outperforms Matlab for vectorized code.
My impression has been that performance order is Numpy > Matlab > R, but as my friend Mike Salib used to say - “All benchmarks are lies”. Anyway, competition is good and discussions like this keep everyone thinking about how to improve their platforms.
Also, keep in mind that performance is often a sticking point for people when it need not be. One of the things I’ve found with dynamically typed languages is that ease of use often trumps raw performance - and you can always move the intensive stuff down to a lower level.
For people who like poking at numbers:
http://www.scipy.org/PerformancePython
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14685.html
http://www.mail-archive.com/numpy-discussion@scipy.org/msg01282.html
Sturla has some strong points here:
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14697.html
25. February 2009 at 11:44 am :
25. February 2009 at 11:48 am :
25. February 2009 at 12:49 pm :
I know 2 people using STATA (social science), 2 people using Excel (philosophy and economics), several using LabView (engineers), some using R (statistical science, astronomy), several using S-Lang (astronomy), several using Python (astronomy) and by using Python, I mean that they are using the packages they need, which might be numpy, scipy, matplotlib, mayavi2, pymc, kapteyn, pyfits, pytables and many more. And this is the main advantage of using a real language for data analysis: you can choose among the many solutions the one that fits you best. I also know several people who use IDL and ROOT (astronomy and physics).
I have used IDL, ROOT, PDL, (Excel if you really want to count that in) and Python and I like Python best :-)
@brendano: One other note: I think that you really have to distinguish between data analysis and data visualization. In astronomy this is often handled by completely different software. The key here is to support standardized file storage/ exchange formats. In your example the people used scipy which does not offer a single visualization routine, so you can not blame scipy for difficulties with 3D plots…
25. February 2009 at 12:58 pm :
There is no denying than if you are into an integrated solution, numpy/scipy is not the best solution of the ones mentioned today - it may well be the worse (I don’t know them all, but I am very familiar with matlab, and somewhat familiar with R). There is a fundamental problem for all those integrated solutions: once you hit their limitations, you can’t go beyond that. Not being able to handle data which do not fit in memory in matlab, that’s a pretty fundamental issue, for example. Not having basic data structures (hashmap, tree, etc…) another one. Making advanced UI in matlab, not easy either.
You can build your own solution with the python stack: the numpy array capabilities are far beyond matlab’s one, for example (broadcasting, advanced indexing are much powerful than matlab current capabilities). The C API is complete, and you can do things which are simply not possible with matlab. You want to handle very big datasets ? pytables give you a database-like API on top of hdf5. Things like cython are also very powerful for people who need speed. I believe those are partially consequences of not being integrated.
Concerning the flaws you mentioned (scipy.linalg vs numpy.linalg, etc…): those are mostly legacies, or exist because removing them would be too costly. There are some efforts to remove redundancy, but not all of them will disappear. They are confusing for a newcomer (they were for me), but they are pretty minor IMHO, compared to other problems.
25. February 2009 at 2:29 pm :
25. February 2009 at 4:45 pm :
Python.
Why? Because given my situation there often are no canned routines. That means soon or later (usually sooner) I will be programming. Of all the languages and packages I’ve used Python has no equal. It is object oriented, has very forgiving run-time behavior, fast turn around (no edit, compile, debug cycles — just edit and run cycles), great built in structures, good modularity, and very good libraries. And, it’s easy to learn. I want to spend my time getting results, not programming, but I have to go through code development since often nothing like what I want to do exists and I’ve got to link the numerics to I/O and maybe some interactive things that make it easy to use and run smoothly. I’ve taken on projects that I would not want to attempt in any of the packages/languages I’ve listed.
I agree that Python is not wart-free. The version compatibility can sometimes be frustrating. “One-stop shopping” for a complete Python package is not here, yet (although Enthought is making good progress). It will never be as fast as MATLAB for certain things (JIT compiling, etc. makes MATLAB faster at times). Python plotting is certainly not up to Mathematica standards (although it is good).
However, the Python community is very nice and very responsive. Python now has several easy ways to add extensions written in C or C++ for faster numerics. And for all my desire not to spend time coding, I must admit I find Python programming fun to do. I cannot say that for anything else I’ve used.
25. February 2009 at 6:35 pm :
25. February 2009 at 9:27 pm :
Can any of these packages compute sparse SVDs like folks have used for Netflix (500K x 25K matrix with 100M partial entries)? Or do regressions with millions of items and hundreds of thousands of coefficients? I typically wind up writing my own code to do this kind of thing in LingPipe, as do lots of other folks (e.g. Langford et al.’s Vowpal Wabbit, Bottou et al.’s SGD, Madigan et al.’s BMR).
What’s killing me now is scaling Gibbs samplers. BUGS is even worse than R in terms of scaling, but I can write my own custom samplers that fly in some cases and easily scale. I think we’ll see more packages like Daume’s HBC for this kind of thing.
R itself tends to just wrap the real computing in layers of scripts to massage data and do error checking. The real code is often Fortran, but more typically C. That must be the same for SciPy given how relatively inefficient Python is at numerical computing. It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.
A problem I see with the way R and BUGS work is that they typically try to compile a declarative model (e.g. a regression equation in R’s glm package or a model specification in BUGS), rather than giving you control over the basic functionality (optimization or sampling).
The other thing to consider with these things from a commercial perspective is licensing. R may be open source, but its Gnu license means we can’t really deploy any commercial software on top of it. Sci-Py has a mixed bag of licenses that is also not redistribution friendly. I don’t know what licensing/redistribution looks like for the other packages.
@bill Support and continuity (by which I assume you mean stability of interfaces and functionality) is great in the core R and BUGS. The problem’s in all the user-contributed packages. Even there, the big ones like lmer are quite stable.
25. February 2009 at 9:46 pm :
Accelerated linear algebra routines written by people who know the processors inside and out will result in big wins, obviously. You can also license the IKML separately and use it to compile NumPy (if I recall correctly, David Cournapeau who commented above was largely responsible for this capability, so bravo!). I figure it’s only a matter of time before somebody like Enthought latch onto the idea of selling a Python environment with IKML baked in, so you can get the speedups without the hassle.
26. February 2009 at 9:32 am :
You said “It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.” We are currently working on ways to expose the mathematical functions underlying NumPy to C, so that you can access it in your extension code. During the last Google Summer of Code, the Cython team implemented a friendly interface between Cython and NumPy. This means that you can code your algorithms in Python, but still have the speed benefits of C.
A number of posts above refer to plotting in 3D. I can recommend Enthought’s Mayavi2, which makes interactive data visualisation a pleasure:
http://code.enthought.com/projects/mayavi/
We are always glad for suggestions on how to improve SciPy, so if you do try it out, please join the mailing list and tell us more about your experience.
26. February 2009 at 12:05 pm :
27. February 2009 at 3:06 am :
I will note that no one defended SAS. Maybe those people don’t read blogs.
27. February 2009 at 3:26 am :
Hmm, I thought I did. I do production work in SAS and mess around (test new stuff, experimental analyses) in R.
Bill
27. February 2009 at 3:35 am :
OK: no one has defended Stata!
4. March 2009 at 2:46 pm :
It’s fast, graphs are great and are virtually no limitations. I’m suprised it wasn’t listed as one of the packages reviewed. We have been using it for years and it is absolutely critical to our business model.
4. March 2009 at 2:48 pm :
5. March 2009 at 3:38 am :
I’m also a big fan of Stata for more introductory level stuff as well as for epidemiology-related stuff. It is developing a programming language that seems useful. One real disadvantage in my book is its ability to hold only one dataset at a time, as well as a limit on the data size.
I’ve also used Matlab for a few years. It’s statistics toolbox is quite good, and Matlab is pretty fast and has great graphics. It’s limited in terms of regression modeling to some degree, as well as survival methods. Syntactically I find R more intuitive for modeling (though that is the lineage I grew up with). The other major disadvantage of matlab is distribution of programs, since Matlab is expensive. The same complaint for SAS, as well:)
5. March 2009 at 4:20 am :
5. March 2009 at 2:59 pm :
In large-scale production, SAS is second to none. Of course, large-scale production shops usually have the $$$ to fork over, and SAS’s workflow capabilities (and, to a lesser extent, large dataset handling capabilities) save enough billable hours to justify the cost. However, for graphics, exploratory data analysis, and analysis beyond the well-established routines, you have to venture into the world of SAS/IML, which is a rather painful place to be. It’s PRNGs are also stuck in the last century, top of the line of a class obsolete for anything other than teaching.
R is great for simulation, exploratory data analysis, and graphics. (I disagree with the assertion that R can’t do high-quality graphics, and, like some commenters above, recommend Paul Murrell’s book on the topic.) It’s language, while arcane, is powerful enough to write outside-the-box analyses. For example, I was able to quickly write, debug, and validate an unconventional ROC analysis based on a paper I read. As another example, bootstrapping analyses are much easier in R than SAS.
In short, I keep both SAS and R around, and use both frequently.
I can’t comment too much on Python. MATLAB (or Octave or Scilab) is great for roll-your-own statistical analyses as well, though I can’t see using it for, e.g., a conventional linear models analysis unless I wanted the experience. R’s matrix capabilities are enough for me at this point. I used Mathematica some time ago for some chaos theory and Fourier/wavelet analysis of images and it performed perfectly well. If I could afford to shell out the money for a non-educational license, I would just to have it around for the tasks it does really well, like symbolic manipulation.
I used SPSS a long time ago, and have no interest in trying it again.
5. March 2009 at 6:11 pm :
You can even easily build SPSS Statistics dialog boxes and syntax for R and Python programs. DevCentral has a collection of tools to facilitate this.
This integration is free with SPSS Base.
9. March 2009 at 8:14 pm :
11. March 2009 at 4:18 am :
To me, the only reason for using sas is because of its large data ability. otherwise, it is a very very bad program. It, from day one, trains it users to be a third rate programmer.
The learning curve for SAS is actually very steep, particularily for a very logical person. Why? the whole syntax in SAS is pretty illogical and inconsistent.
sometimes, it is ‘/out’ sometimes, it is ‘output’.
In 9.2, SAS started to make variables inside a macro as local variables by default.
This is ridiculous!! SAS company has existed for at least 30 years. How can this basic programming rule should be implemented after 30 years?!
Also, if a variable is uninitialized, SAS will still let the code run. One time, I worked in a company, this simple stupid SAS design flaw causes our project 3 weeks of delay (there is one uninitialized varaible among 80k lines of log, all blue). A couple of PhDs in the project who used C and Matlab did not believe why SAS makes such a stupid mistake. Yes, with a big disbelief, it made!
My ranking is that Matlab and R are about the same, Matlab is better in plots most times. R is better is manipulation datasets. stata and SAS are the same level.
After taking into account of cost, then the answer is more obvious.
12. March 2009 at 1:37 pm :
And, really, what competent programmer would ever use a variable without initializing or testing it first? That’s a basic programming rule I learned back in the mid ’60s, after branching off of uninitialized registers, and popping empty stacks.
Bah, you kids. Get off of my lawn!
13. March 2009 at 4:57 am :
we had a demo of omniture’s discover onpremise (formerly visual sciences), and the visualization tools are fairly amazing. it seems like an interesting solution for trending real time evolving data, but we aren’t pulling the trigger on it now.
13. March 2009 at 9:12 am :
/I3az/
13. March 2009 at 9:14 am :
pdl.perl.org
13. March 2009 at 5:52 pm :
On the other hand, the integration with both numpy and R is quite new, so it’s immature as a stats tool compared to the other packages in this list.
Full transparency: I work for Resolver Systems, so obviously I’m biased towards it :-) Still, we’re very keen on feedback, and we’re happy to give out free copies for non-commercial research and for open source projects.
13. March 2009 at 7:29 pm :
16. March 2009 at 4:37 pm :
I am primarily a SAS user (over 20 years) who has been using R as needed (a few years) to do things that SAS cannot do (like MARS splines), or cannot do as well (like exploratory data analysis and graphics), or requires expensive SAS products like Enterprise Miner to do (like decision trees, neural networks, etc).
I have worked primarily for financial service (credit cards) companies. SAS is the primary statistical analysis tool in these companies partly due to history (S, the precursor to S+ and R, was not yet developed) and partly because it can run on mainframes (another legacy system) accessing huge amounts of data stored on tapes, which I am not sure any other statistical package can. Furthermore, business who have the $ will be the last to embrace open source software like R, as they generally require quick support when they get stuck trying to solve a business problem, and researching the problem in a language like R is generally not an option in a business setting.
Also, SAS’ capabilities for handling large volumes of data are unmatched. I have read huge compressed files of online data (Double Click), having over 2 billion records, using SAS, to filter the data and keep only the records I needed. Each of the resulting SAS datasets were anywhere from 35 GB to 60 GB in size. As far as I know, no other statistical tool can process such large volumes of data programatically. First we had to be able to read in the data and understand it. Sampling the data for modeling purposes came later. I would run the SAS program overnight, and it would generally take anywhere from 6 to 12 hours to complete, depending on the load on the server. In theory, any statistical software that works with records one at a time should be able to process such large volumes of data, and maybe the Python based tools can do this. I do not know as I have never used them. But I do know that R, and even tools like WEKA cannot process such volumes of data. Reading the data from a database, using R, can mitigate the large data problems encountered in R (as does using packages like biglm), but SAS is the clear leader in handling large volumes of data.
R on the other hand is better suited for academics and research, as cutting edge methodologies can be and are implemented much more rapidly in R than in SAS, as R’s programming language has more elegant support for vectors and matricies than SAS (proc IML). R’s programming language is much more elegant and logically consistent, while SAS’ programming language(s) are more adhoc with non-standard programming constructs. Furthermore, people who prefer R generally have a stronger “theoretical” programming background (most have programmed in C, Perl, or objected oriented languages) or are able to pick up programming faster, while most users who feel comfortable with SAS have less of a programming background and can tolerate many of SAS’ non-standard programming constructs and inconsistencies. These people do not require or need a comprehensive programming language to accomplish their tasks, and it takes much less effort to program in base SAS than in R if one has no “theoretical” programming background. SAS macros take more time to learn and many programming languages have no equivalent (one exception I know are C’s pre-processor commands). But languages like R do not need anything like SAS macros and can achieve the same results all in one, logically consistent, programming language, and do more, like enabling R users to write their own functions. The equivalent to writing functions in R, in SAS, is to now program a new proc in C and know how to integrate it with SAS. An extremely steep learning curve. SAS is more of a suite of products, many of them with inconsistent programming constructs (base SAS is totally different from SCL - formerly Screen Control language but now SAS Component Language), and proc SQL and proc IML are different from data step programming.
So while SAS has a shallow learning curve initially (learn only base SAS), the user can only accomplish tasks of “limited” sophistication with SAS, without resorting to proc IML (which is quite ugly). For the business world this is generally adequate. R, on the other hand, has a steeper learning curve initially, but tasks of much greater sophistication can handled more easily in R than is SAS, once R’s steeper learning curve is behind you.
I forsee an increased use of R relative to SAS over time, as many statistical departments at Universities have started teaching R (sometimes replacing SAS with R) and students graduating from these universities will be more conversant with R, or equally conversant with both SAS and R. Many of these students entering the workforce will gravitate towards R, and to the extent the companies they work for do not mandate which statistical software to use, the use of R is bound to increase over time. With memory becoming cheaper, and Microsoft based 64 bit operating systems becoming more prevalent, bigger data sets can be stored in RAM, and R’s limitation in handling large volumes of data are starting to matter less. But the amount of data is also starting to grow, thanks to the internet, scanners (used in grocery chains), etc., and the volume of data may very well grow so rapidly that even cheaper RAM and 64 bit operating systems may not be able to cope with the data deluge. But not every organization works with such large datasets.
For someone who has started their careers using SAS, SAS is more than adequate to solve all problems faced in the business world, and there may seem to be no real reason, or even justification to learn packages like R or other statistical tools. To learn R, I have put in much personal time and effort, and I do like R and have been and forsee using it more frequently over time for exploratory data analysis, and in areas where I want to implement cutting edge methodologies, and where I am not hampered by large data issues. Personally, both SAS and R will always be part of my “tool kit” and I will leverage the strengths of both. For those who do not currently use R, it would be wise to start doing so, as R is going to be more widely used over time. The number of R users has already reached critical mass, and since R is free, this is bound to increase the usage of R as the R community grows. Furthermore, the R Help Digest, and the incredibly talented R users that support it, is an invaluable aid to anyone interested in learning R.
17. March 2009 at 1:08 am :
20. March 2009 at 3:36 am :
By the way, here’s my (unfair) generalization regarding usage:
– R: academic statisticians
– SAS: statisticians and data-y people in non-academic settings, plus health scientists in academic and non-academic settings
– SPSS: social scientists
– Stata: health scientists
23. March 2009 at 5:58 pm :
19. April 2009 at 2:03 am :
I am a junior SAS user with only 3 year experience. But even I know that you need to press ‘ctrl’ and ‘F’ to search for ‘uninitialized’ and ‘more than’ in SAS log to ensure everything is OK.
As far as a couple C++PHD in your group is concerned, they need to understand to play with rules of whatever system they are using……
19. April 2009 at 2:07 am :
30. April 2009 at 6:58 pm :
I ran Stata on linux server with 16GB ram and about 2TB of disk storage. The Hardware config was about $12K. I would not recommend using virtual memory for Stata. That said, you can stick a lot of data in 16GB ram! If I pay attention to the variable sizes (keep textual ones out), I got 100s of millons of rows into memory.
Stata supports scripting (*do files) and are very easy to use as is the GUI. The GUI is probably the best feauture.
The Hardware ($12,000) + Software ($3000 - 2 user license) costs $15,000. The equivilient SAS software was about $100,000. You do the math.
I’ve used SPSS, but that was a while ago. At that time I felt Stata was the superior product.
1. May 2009 at 2:08 am :
> I ran Stata on linux server with 16GB ram and about 2TB of disk storage.
> I would not recommend using virtual memory for Stata.
In my experience, virtual memory is *always* a bad idea. I remember working with ops guys who would consider a server as good as dead once it started using swap.
All programs that effectively use hard disks always have custom code to control when to move data on and off the disk. Disk seeks and reads are just too slow and cumbersome compared to RAM to have the OS try to automatically handle it.
This would be my guess why SAS handles on-disk data so well - they put a lot of engineering work into supporting that feature. Same for SQL databases, data warehouses, and inverted text indexes. (Or the widespread popuarity of Memcached among web engineers.) R, Matlab, Stata and the rest were originally written for memory data and still work pretty much only in that setting.
1. May 2009 at 2:48 am :
1. May 2009 at 9:02 pm :
http://www.ats.ucla.edu/stat/technicalreports/
There’s also an interesting reply from Patrick Burns, defending R and comparing it to those 3.
http://www.ats.ucla.edu/stat/technicalreports/Number1/R_relative_statpack.pdf
(Found linked from a comment on John D. Cook’s blog here:
http://www.johndcook.com/blog/2009/05/01/r-the-good-parts/ )
27. May 2009 at 9:37 pm :
28. May 2009 at 4:54 am :
Below is a summary of the summary — !!! — with my own observations added on.
SAS: Scripting language is awkward, but it’s great for manipulating complex data structures; folks that analyze relational DBs (e.g. govt. folks) tend to use it.
SPSS: Great for the “weekend warriors”; strongly GUI-based; has a scripting language, but it’s in-elegant. They charge a license for **each** “module” (e.g. correlations? linear regressions? Poisson regressions? A separate fee!). Also, charge an annual license. Can read Excel files directly. Used to have nicer graphs and charts than Stata (but, see below).
Stata: Elegant, short-’n'-punchy scripting language; CLI and script-oriented, but also allows GUI. Strong user base, with user-written add-ons available for D/L. **Excellent** tech support! The most recent version (Stata 10) now has some pretty powerful chart/graph editing options (GUI, plus CLI, your choice) that makes it competitive with the SPSS graphs. (Minor annoyance: ever few versions, they make the data format NOT back-compatible with the previous version — have to remember to “Save As” last-year’s version, or else what you save at work won’t open at home…)
My background: Took a course on SAS, but haven’t had a reason to use it. I’ve used SPSS and Stata both, on a reasonably regular basis: I currently teach “Intro to Methods” courses with SPSS, but use Stata for my own work. I dislike how SPSS handles missing values. Unlike SPSS, Stata sells a one-time license: once you buy a version, it’s yours to keep until you feel it’s too obsolete to use.
–GG
28. May 2009 at 1:53 pm :
Another way of’ differentiating between various statistical software packages is its Geek Cred. I usually tell my Intro to Research Methods (for the social sciences), that…
(On a scale of 0-10…)
R, Matlab, etc. = 9
SAS = 7
Stata = 5
SPSS = 3
Excel = 2
YMMV. :)
COMMENT ON EXCEL: It’s a spreadsheet, first and foremost — so it doesn’t treat rows (cases) as “locked together”, like statistical software does. Thus, when you highlight a column and ask it to sort, it sorts **only** that column. I got burned by this once, back in my first year of grad school, T.A.-ing: sorted HW #1 scores (out of curiosity), and didn’t notice that the rest of the scores had stayed put. Oops.
I now keep my gradebooks in Stata. :)
–GG
29. May 2009 at 1:29 pm :
1) Data size = truly unlimited. I learned to span 6 DASD (Direct Access Storage Devices) = disk drives on the mainframe for when I was processing > 100 million records = quotes and trading activity from all exchanges. We we went to Unix, we used 100 GB worth of temp “WORK” space, and were processing > 1 Billion transaction a day in < 1 hour (IBM p630 with 4x 1.45 GHz processors and 32 GB of memory, only the processing actually used < 4 GB).
2) Tons and tons of preprogrammed statistical functions with just about every option possible.
3) SAS can read data from almost anything: tapes, disk, etc. fixed field flat files, delimited text files (any delimiters, not just comma or tab or space), xml, most any database, all mainframe data file times. It also translates most any text value into data, and supports custom input and output formats.
SAS is difficult for most real programmers (I took my first programming class in 1977, and have programmed in more languages than I care to share) because it has a data centric perspective as opposed to machine/control centric. It is meant to simplify the processing of large amounts of data for non-programmers.
SAS used to have incredible documentation and support, at incredibly reasonable prices. Unforturnately, the new generation of programmers and product managers have lost their way, and I agree that SAS has been becoming a beast.
For adhoc work, I immediately fell in love with SAS/EG = Enterprise Guide. Unfortunately, EG is written in .net and is not that well written. I would have preferred it being written in Java so that the interface was more portable and supported a better threading model. Oh well.
One of the better features of SAS is that it is not an intepreted programming language, but from the start in 197? it was JIT. Basically, a block of code is read, compiled, and then executed. This is why it is so efficient at processing huge amounts of data. The concept of the “data step” does allow for some built in inefficiencies from the standpoint of multiple passes through the data, but that is because of SAS’s convenience. A C programmer would have done more things, in fewer passes, but the C programmer would have spent many more hours writing the programmer than SAS’s few minutes to do the same thing. I know this because I’ve done it.
Some place I read a complaint about SAS holding only one observation in memory at a time. That is a gross misunderstanding/mistake. SAS holds one or more blocks of observations (records) in memory at a time. The number held is easily configurable. Each observation can be randomly accessed, whether in memory or not.
SAS 9.2 finally fixes one the bigger complaints with PROC FCMP allowing the creation of custom functions. Originally SAS did not support custom functions, SAS wanted to write them for you.
The most unfortunate thing about SAS currently is that it has such a long legacy on uniprocessor machines, that it is having difficulty getting going in the SMP world, being able to properly take advantage of multi-threading and multi-processing. I believe this is due to lack of proper technical vision and leadership. As such, I believe a Java language HPC derivative and tools will eventually take over, providing superior ease of use, visualization, portability, and processing speed on today’s servers and clusters. Since most data will come from an RDMS these days, flat file input won’t carry enough weight.
But, for my current profession = Capacity Planning for computer systems, you still can’t beat SAS + Excel. On the other hand, it looks like I’m going to have to look into R.
29. May 2009 at 1:47 pm :
To me, SAS is/was the greatest data processing language/system on the planet. But, I still also program in Java, C, ksh, VBScript, Perl, etc. as appropriate. I’d like to see someone do an ARIMA forecast in Excel, or run a regression that does outlier elimination in only 3 lines of code!
11. June 2009 at 1:56 am :
One thing you have to consider, is that using SciPy, you get all of the python libraries for free. That includes the Apache Hadoop code, if you choose to use that. And as someone above pointed out, there is now parallel processing built right in in the most recent distributions (but I have no personal knowledge of that) for MPI or whatever.
Coming from an engineer in industry (not academia), the really neat thing that I like about SciPy is the ease of creating web-based tools (as in, deployed to a web server for others to use) via deployment on an apache installation and mod_python. If you can get other engineers using your analysis, without sending them a excel spreadsheet, or a .m file (for which they need a matlab license), etc. it makes your work much more visible.
14. June 2009 at 10:30 am :
i want to know about the comrative study between SAS, R, SPSS in data analysis.
can anyone provide me the papers related to those.
18. June 2009 at 11:28 am :
it has a very powerful interpreted scripting language which allows one to easily extend stata. there is a very active community and many user written add-ons are available. see: http://ideas.repec.org/s/boc/bocode.html
stata also has a full fledged matrix programming language called (mata) comparable to matlab with a c-like syntax, which is compiled and therefore very fast.
managing and preparing data for analysis is a breeze in stata.
finally stata is easy to learn.
obviously not many people use stata around here.
some more biased opinions:
sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?
excel sucks obviously infinitely more that sas. apart from its (lack of) statistical capabilities and reliability, any point-and-click only software is an obvious no-no from the point of view of scientific reproducability
i don’t care fore spss and cannot imagine anyone does.
matlab is nice, but expensive. not so great for preparing/managing data.
have not used scipy/numpy myself, but have colleagues who love it. one big advantage is that it uses python (ie good language to master and use)
r is great, but more difficult to get into. i don’t like the loose syntax too much though. it is also a bitch with big datasets.
17. July 2009 at 6:53 am :
17. July 2009 at 10:57 pm :
Project Gemini sneak preview
I doubt this would make Excel the platform of choice for doing anything fancy with large datasets anyways, but I am intrigued.
26. July 2009 at 9:48 pm :
SAS has great support for large files even on a modest machine. A few years ago I did a bunch of sims on my dissertation using it and it worked happily away without so much batting an eyelash on a crappy four year old Windoze XP machine with 1.5 GB of memory. Also, programs like NLP (nonlinear optimization), NLMIXED, MIXED, and GLIMMIX are really great for various mixed model applications—this is quite broad as many common models can be cast in the mixed model framework. NLMIXED in particular lets you write some pretty interesting models that would otherwise require special coding. Documentation in SAS/STAT is really solid and their tech support is great. Graphics suck and I don’t like the various attempts at a GUI.
I prefer Stata for most “everyday” statistical analysis. Don’t knock that, as it’s pretty common even for a methodologist such as myself to need to fit logistic regression or whatever and not want to have to waste a lot of time on it, which Stata is fantastic for. Stata 11 looks to be even better, as it incorporates procedures such as Multiple Imputation easily. The sheer amount of time spent doing MI followed by logistic regression (or whatever) is irritating. Stata speeds that up. Also when you own Stata you own it all and the upgrade pricing is quite reasonable. Tech support is also solid.
SPSS has a few gems in its otherwise incomprehensible mass of utter bilge. IMO it’s a company with highly predatory licensing, too.
R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility. I like it for class because it’s free, there are nice books for it, and it lets me bypass IT as it’s possible to put a working R system on a USB drive. I love the graphics.
Matlab has made real strides as a programming language and has superb numerics in it (or did), at least according to the numerics people I know (including my numerical analysis professor). However, Statistics Toolbox is iffy in terms of what procedures it supports, though it might have been updated. Graphics are also nice. But it is expensive.
Mathematica is nice for symbolic calculation. With the MathStatica addon (sadly this has been delayed for an unconscionable amount of time) it’s possible to do quite sophisticated theoretical computations. It’s not a replacement for your theoretical knowledge, but is very helpful for doing all the inaccurate and tedious calculations necessary.
27. July 2009 at 10:58 am :
Matlab is good for linear algebra and related multivariate stats. I could never get any nice plotting out of it. It can do plenty of things I never learnt about, but I can’t afford to buy it, so I can’t use it now anyway.
R is powerful, but can be very awkward. It can write jpeg, png, and pdf files, make 3D plots and nice 2D plots as well. Two things put me off it: it’s an absolute dog to debug (how does “duplicate row names are not allowed” help as an entire error message when I’ve got 1000 lines of code spread between 4 functions?), and its data types have weird eccentricities that make programming difficult (like transposing a data frame turns it into a matrix, and using sapply to loop over something returns a data frame of factors… I hate factors). There are a lot of packages that can do some really nice things, although some have pretty thin documentation (that’s open source for you).
Octave is nicer to use than R ( = Matlab is nicer to use than R), but I found it lacking in most things I wanted to do, and the development team seem to wait for something to come out in Matlab before they’ll do it themselves, so they’re always one step behind someone else.
I’m surprised how quickly I’m picking up SciPy. It’s much easier to write, read and debug than R, and the code looks nicer. I haven’t done much plotting yet, but it looks promising. The only trick with Python is its assignments for mutable data types, which I’m still getting my head around.
29. July 2009 at 9:45 pm :
http://reference.wolfram.com/mathematica/note/SomeNotesOnInternalImplementation.html#28959
(I work for Wolfram Research)
30. July 2009 at 11:43 pm :
Everyone really likes Stata. Interesting.
19. August 2009 at 6:17 pm :
For instance, here’s an example of taking some mutual fund data, and visualizing those mutual funds (from 3 different categories) in a Fisher Linear Discriminant transformed space (down to 3 dimensional from initial 57 or so)
http://yaroslavvb.com/upload/strands/dim-reduce/dim-reduce.html
21. August 2009 at 3:44 am :
21. August 2009 at 3:48 am :
1. September 2009 at 3:21 am :
11. September 2009 at 6:38 am :
I work in ecology and I use Mathematica almost exclusively for modeling. I’ve found that the the elegance of the programming language lends itself to easily using it for statistical analysis as well. Although it isn’t really a statistics package being able to generate large amounts of data and then process them in the same place is extremely useful. To make up for the lack of built in statistical analysis I’ve built my own package over time by collecting and refining the tests I’ve used.
For most people I would say using Mathematica for statistics is way more work than it is worth. Nevertheless, those who already use it for other things may find it is more than capable of performing almost any data analysis you can come up with using relatively little code. The addition of functionality targeted at statistics in versions 6 and 7 has made this use simpler, although the built in ANOVA package is still awkward and poorly documented. One thing it and Matlab beat other packages at hands down is list/matrix manipulation which can be extremely useful.
14. September 2009 at 9:10 pm :
Thank you.
Paul
25. September 2009 at 1:39 pm :
maximize amount of useful output
such that: salaries of staff * hours worked - cost of software < budget
It turns out IMF achieves that by letting every economist work with whatever they want. As a matter of fact, economists end up using Stata.
Consider that most economics datasets are smaller than 1Gb. Stata MultiProcessor will work comfortably with up to 4Gb on the available machines. Stata has everything you need for econometrics, including a matrix language that is just like Matlab and state of the art maximum likelihood optimization, so you can create your own “odd” statistical estimators. Programming has a steeper learning curve than Matlab but once you know the language it’s much more powerful, including very nice text data support and I/O (not quite python, but good enough). If you don’t need some of the fancy add-on packages that engineers use, like say “hydrodynamics simulation”, that’s all you need. But most importantly importing, massaging and cleaning data with Stata is so unbelievably efficient that every time I have to use another program I feel like I am walking knee-deep in mud.
So why do I have to use other programs, and which?
IMF has one copy of SAS that we use for big jobs, such as when I had 100Gb of data. I won’t dwell on this because it’s been covered above, but in general SAS is industrial-grade stuff. One big difference between SAS and other programs is that SAS will try to keep working when something goes wrong. If you *need* numbers for the next morning, you go to bed, the next morning you come and Stata has stopped working because of a mistake. SAS hasn’t, and perhaps your numbers are garbage, but if you are able to tell that they are simply 0.00001% off then you are in perfectly good shape to make a decision.
Occasionally I use Matlab or Gauss (yes, Gauss!) because I need to put the data through some black box written in that language and it would take too long to understand it and rewrite it.
That’s all folks. Thanks for the attention.
25. September 2009 at 6:42 pm :
25. September 2009 at 7:37 pm :
30. September 2009 at 6:59 am :
30. September 2009 at 5:56 pm :
You will see ROOT is definitely better than R.
30. September 2009 at 5:59 pm :
2. January 2010 at 7:10 pm :
> the majority of ‘R’ers on this thread act like a bunch of rebellious teens …
Well spotted — I’ve been a rebellious teen for decades now.
10. January 2010 at 10:35 am :
But!!! I keep using it and keep discovering new ways of using it. Now, I use ‘dmsend’ function from the ‘twitteR’ package to inform me the status of my time-consuming simulations while I am not in office. It is just awesome that using R makes me feel bounded by nothing.
BTW, anyone knows how to use R to send emails (on various OS, Win, Mac, Unix, Linux). I googled a bit and not very promising. Any plans to develop a package?
If we had the package, we can just hit ‘paste to console’ (RWinEdt) or C-c C-c (ESS+Emacs) and let R to estimate, simulate and send results to co-authors automatically. What a beautiful world!!
I use Matlab and STATA as well but R completely owns me. Being a bad boy naturally, I start to encourage new comers to use R in my work place.
13. January 2010 at 8:30 pm :
Been using SPSS for over 30 years and I’ve been appreciating the steep increase in usability from punch card syntax to pull down menu’s. I only ran into R today because it can handle Zero Inflated Poisson Regression and SPSS can’t or won’t.
I think it is Great to find open source statistical software. I guess it requires a special ment framework to actually enjoy struggling through the command structure, but if I were 25 years younger………
It really is a bugger to find that SPSS (or whatever they like to be called) and R come up with different parameter estimates on the same dataset [at least in the negative binomial model I compared].
Is there anyone out there with experience in comparing two or more of these packages on one and the same dataset?
16. January 2010 at 9:58 am :
Why don’t you join R: mailing list? If you ask questions properly there, you will get answers.
I would suggest a place to start: http://www.r-project.org/mail.html
Have fun.
27. January 2010 at 10:22 am :
I am new to R.I would like to know R-PLUS.Does any know where can I get the free training for R-PLUS.
Regards,
Peng.
12. February 2010 at 8:38 pm :
I’ve looked at Matlab, but the primitive nature of its language turns my stomach. (I mean, here’s a language that uses alternating strings and values to imitate named parameters? A language where it’s not unusual to have a half page of code in a routine dedicated to filling in parameters based on the number of supplied arguments.) And the Matlab culture seems to favor Perleqsue obfuscation of code as a value. Plus it’s expensive. It’s really an engineer’s tool, not a statistician’s tool.
SAS creeps me out: it was obviously designed for punched cards and it’s an inconsistent mix of 1950’s and 1960’s languages and batch command systems. I’m sure it’s powerful, and from what I’ve read the other statistics packages actually bend their results to match SAS’s, even when SAS’s results are arguably not good. So it’s the Gold Standard of Statistics ™, literally, but it’s not flexible and won’t be comfortable for someone expecting a well-designed language.
R’s language has a good design that has aged well. But it’s definitely open source: you have two graphical languages that come in the box (base and lattice), with a third that’s a real contender (ggplot2). Which to choose? There are over 2,000 packages and it takes a bit of analysis just to decide which of the four Wavelet packages you want to use for your project — not just current features, but how well maintained the package appears to be, etc.
There are really three questions to answer here: 1) What field are you working in, 2) How focused are your needs, and 3) What’s your budget?
In engineering (and Machine Learning and Computer Vision), 95% of the example code you find in articles, online, and in repositories, will be Matlab. I’ve done two graduate classes using R where Matlab was the “no brainer” choice, but I just can’t stomach Matlab “programming”. Python might’ve been a good choice as well, but with R I got an incredible range of graphics combined with multiple a huge variety of statistical and learning techniques. You can get some of that in Python, but it’s really more of a general-purpose tool when you definitely have to roll your own.
13. February 2010 at 5:55 am :
17. February 2010 at 9:43 am :
17. February 2010 at 12:03 pm :
25. February 2010 at 10:24 am :
27. February 2010 at 12:05 am :
SAS seems to excel at data handling, both with large datasets and with wacked proprietary formats (how else can you read a 60GB text file and merge it with an access database from 1998). It is really ugly though, not interactive/ exploratory, and graphics aren’t great.
R is awesome because it is a fully featured language (things like named parameters, object orientation, typing) etc, and because every new data analysis algorithm probably gets implemented in it first these days. I rather like the graphics. However, it is a mess, with bad naming conventions that have evolved badly over time, conflicting types, etc.
Matlab is awesome in its niche, which is NOT data analysis, but rather math modeling with scripts between 10 and 1000 lines. It is really easy to get up an running if you have a math (ie linear algebra) background, the function file system is great for a medium level of software engineering, plotting is awesome and simpler than R, the datatypes (structs) are complex enough but dont’ involve the headaches of a “well developed” type system. If you are doing data management, gui interaction, or dealing with categorical data, it might be best to use SQL/ SAS or something else and export your data into matrices of numbers.
I would like numpy and friends, but ZERO BASED INDEXING IS NOT MATHEMETICAL.
Just my 2c
16. April 2010 at 4:52 pm :
After working as an econometrics analyst for a while mainly using stata, I can tell the following about STATA:
Stata is relativly easy to get startet with and to produce some graphics quickly (that’s what all the business people want, click click here’s your powerpoint presentation with lots of colourful graphics and no real content).
BUT if you want to automate things and if you want to make stata to do things it isn’t capable of out of the box, it is pure pain!
The big problem is: On one hand Stata has a scripting/command interface, which is not very powerful and very very inconsistent. On the other Hand, stata has a fully featured matrix-orientated programming language with c-like syntax, which is c-like, therefore not very handy (c is old and not made for mathematics, the matlab language is much more convenient), and which doesn’t work well with the rest of stata (you have a superflous level for interchanging data from one part to the other).
All together programming STATA feels like persuading STATA:
Error messages are almost useless, the macro text expansion used in the scripting language is not very suitable for things that has to do with mathematics (texts can’t calculate), and many other little things.
It is very inconsitent sometimes very clumsy to handle and has silly limitations like string expressions limited to 254 chars like in the early 20th century.
So go with stata for a little ad hoc statistics but do not use it for more sophisticated stuff, in that case learn R!
19. April 2010 at 11:13 pm :
I did have to spent a lot of time learning how to think in Mathematica - it’s most powerful when used as a functional language, and I was a procedural programmer. However, if you want to use a procedural programming approach, Mathematica supports that.
Regarding some of the other topics discussed above: (1) Mathematica has build in support for parallel computing, and can be run on supercomputing clusters (Wolfram Alpha is written in Mathematica). (2) The language is highly evolved and is being actively entended and improved every year. It seems to be in an exponential phase of development currently - Stephen Wolfram outlines the development plans every year and the annual user conferenced - and his expectations seem to be pretty much on target. (3) Wolfram has a stated goal of making Mathematica a universal computing platform which smoothly integrates theoretical and applied mathematics with general purpose, graphics, and computation. I admit to a major case of hero worship, but I think he is achiving this goal.
I’m going on and on about Mathematica because, in spite of its wonderfulness, it doesn’t seem to have taken it’s rightful place in these discussions. Maybe Mathematica users drop out of the “what’s the best language for x” after they start using it. I don’t know, really. But anyway, that’s the way I see it.
25. April 2010 at 12:54 am :
25. April 2010 at 1:23 pm :
27. April 2010 at 4:26 am :
original dengan harga miring,banyak buku teknik. silahkan kunjungi
http://bupka.wordpress.com
buku MATLAB yg dibicarakan diatas, ada stok saat ini.
silahkan liat2 lainnya juga.
27. April 2010 at 9:37 am :
Also, SAGE ( http://www.sagemath.org ), the open source alternative to Mathematica has gotten quite powerful in the last few years.
8. May 2010 at 6:16 am :
Also, I’m starting to use Matlab now and loving how intuitive it is (for someone with programming experience anyway). st
9. May 2010 at 5:40 pm :
“I’m one of the two originators of R. After reading Jan’s
paper I wrote to him and said I thought it was interesting
that he was choosing to jump from Lisp to R at the same
time I was jumping from R to Common Lisp……
We started work on R in the early ’90s. At the time
decent Lisp implementations required much more resources
than our target machines had. We therefore wrote a small
scheme-like interpreter and implemented over that.
Being rank amateurs we didn’t do a great job of the
implementation and the semantics of the S language which
we borrowed also don’t lead to efficiency (there is a
lot of copying of big objects).
R is now being applied to much bigger problems than we
ever anticipated and efficiency is a real issue. What
we’re looking at now is implementing a thin syntax over
Common Lisp. The reason for this is that while Lisp is
great for programming it is not good for carrying out
interactive data analysis. That requires a mindset better
expressed by standard math notation. We do plan to make
the syntax thin enough that it is possible to still work
at the Lisp level. (I believe that the use of Lisp syntax
was partially responsible for why XLispStat failed to gain
a large user community).
The payoff (we hope) will be much greater flexibility and
a big boost in performance (we are working with SBCL so
we gain from compilation). For some simple calculations
we are seeing orders of magnitude increases in performance
over R, and quite big gains over Python…..”
the full post is here:
http://r.789695.n4.nabble.com/Ross-Ihaka-s-reflections-on-Common-Lisp-and-R-td920197.html#a920197
it is quite interesting to note that such a “provactive” post from one of R’s originators got 0 response from R-dev list………..
23. May 2010 at 5:36 am :
16. June 2010 at 4:12 pm :
I’m trying to decide which software package to use. I’m a researcher working with clinical (patient-related) data. I have data sets with <10,000 rows (usually just a few thousand). I need software that will generate multivariate and logistic regression, and Kaplan-Meier survival curves. Visualization is very important.
Of note, I’m an avid programmer as a hobby (C++, assembly, most anything), so I’m very comfortable with a more complex package, but I need something that just works. I’ve been using SPSS, which works, but clunky.
Any suggestions? Stata? Systat? S-Plus? Maple?
16. June 2010 at 5:13 pm :
R might be worth trying too.
5. July 2010 at 12:48 am :
2. August 2010 at 2:44 am :
After you peel back all the layers and look at the solution that would require the least effort, the most power, with the greatest flexibility, why anyone would choose anything other than RPy first, and then the language du joire that your employer would be using second as a backup and scrap the code war?
I mean for my money, you make sure you can build a model in Excel, learn RPy & C# and search for APIs if you need to user other languages or just plain partner with someone who can code C++ {if you can’t} and simply inject it.
I mean I plan on learning Java, PHP and SAS as well, but that is really a personal choice. Coming from IT with in Finance, not knowing Java and SAS means you either won’t get in the door or reach a glass ceiling pretty quickly unless you play corporate politics really, really well. So for me, it is a necessity. But the flip side is, wanting to make the leap into Financial Engineering after completing a doctorate in Engineering, RPy has also become a near Realistically, unless you just like coding, I have to say that what I have suggested makes the most sense for the average analysis pro. But then alot of this is based upon whether you’re a Quant Research, Quant Developer, Analyst, etc. — different tools for different functions.
Just thought
14. August 2010 at 11:06 pm :
1. there is a book out on the topic (http://www.amazon.com/gp/product/1420070576?ie=UTF8&tag=sasandrblog-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1420070576)
2. R interface available in SAS 9.2
“While SAS is committed to providing the new statistical methodologies that the marketplace demands and will deliver new work more quickly with a recent decoupling of the analytical product releases from Base SAS, a commercial software vendor can only put out new work so fast. And never as as fast as a professor and a grad student writing an academic implementation of brand-new methodology.
Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers.
“We know a lot of our users have both R and SAS in their toolkit, and we decided to make it easier for them to access R by making it available in the SAS 9.2 environment,” said Rodriguez.
The SAS/IML Studio interface allows you to integrate R functionality with SAS/IML or SAS programs. You can also exchange data between SAS and R as data sets or matrices.
“This is just the first step,” said Radhika Kulkarni, Vice President of Advanced Analytics. “We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure.“
http://support.sas.com/rnd/app/studio/Rinterface2.html
While this is probably more for SAS users than R, I thought both camps might be interested in case you get coerced into using SAS one day… doesn’t mean you have to give up your experience with R.
26. August 2010 at 4:51 pm :
- full support of R
- fully scriptable, which means you can call DLLs written in whatever programming language possible and implementing things which you didn’t find inbuilt in Statistica (which doesn’t mean it’s not there)
- the Statistica solver / engine can be called externally from Excel and other applications via the COM/OLE interface
- untrammelled graphics of virtually any complexity — extremely flexible and customizable (and scriptable)
- the Data Miner (with its brand new ‘Data Miner Recipes’) is another extremely powerful tool that leaves only your imagination to limit you
….it would be tedious to list all its advantages (again, the Statistica Neural Networks and the Six Sigma modules are IMO very professionally implemented).
31. August 2010 at 12:08 pm :
4. September 2010 at 2:01 pm :
23. September 2010 at 10:31 pm :
15. October 2010 at 4:16 am :
17. October 2010 at 4:40 am :
On IBM mainframes the choice of languages to use is limited and by default this will usually be SAS. Most large organisations have SAS, at least Base SAS, installed by default because the Merrill MXG capacity planning software uses it. Hence cost is sort of irrelevant. It then tends to be used for anything requiring processing of text files even in production applications and this often means processing text as text, e.g. JCL with date dependent parameters, rather than as preparing data for loading into SAS datasets for statistical analysis.
I know nothing about R but seeing a few code sample it struck me how it resembled APL to which we were introduced in our stats course in college in the early 70s, not surprising as both are matrix oriented.