pyspark-dataframes

python - 没有模块名称pyspark错误

这是我正在学习的教程中的确切代码。我的同学用同样的代码没有得到这个错误:ImportErrorTraceback(mostrecentcalllast)in()---->1frompysparkimportSparkContext2sc=SparkContext('local','Exam_3')34frompyspark.sqlimportSQLContext5sqlContext=SQLContext(sc)ImportError:Nomodulenamedpyspark这是代码:frompysparkimportSparkContextsc=SparkContext('local

python - 将 PySpark DataFrame ArrayType 字段组合成单个 ArrayType 字段

我有一个带有2个ArrayType字段的PySparkDataFrame:>>>dfDataFrame[id:string,tokens:array,bigrams:array]>>>df.take(1)[Row(id='ID1',tokens=['one','two','two'],bigrams=['onetwo','twotwo'])]我想将它们组合成一个ArrayType字段:>>>df2DataFrame[id:string,tokens_bigrams:array]>>>df2.take(1)[Row(id='ID1',tokens_bigrams=['one','two'

ArrayType DataFrame tokens code two python apache-spark pyspark apache-spark-sql

python - 计算 Pandas DataFrame 沿每一列的自相关

我想计算PandasDataFrame的列中滞后长度之一的自相关系数。我的数据片段是:RFPCCDPNDNPyear1890NaNNaNNaNNaNNaNNaNNaN1891-0.028470-0.0526320.0422540.081818-0.0455410.047619-0.0169741892-0.2490840.0000000.0270270.0672270.0994040.0454550.12233718930.6536590.0000000.0000000.039370-0.1356240.043478-0.142062年，我想计算每列(RF、PC等...)滞后一的自相关

DataFrame python nan code pandas numpy

python - PySpark 使用字典映射创建新列

使用Spark1.6，我有一个SparkDataFrame列(命名为col1)，其值为A、B、C、DS、DNS、E、F、G和H。我想用下面的dict中的值创建一个新列(比如col2)。我如何映射这个？(例如，“A”需要映射到“S”等)dict={'A':'S','B':'S','C':'S','DS':'S','DNS':'S','E':'NS','F':'NS','G':'NS','H':'NS'} 最佳答案 UDF的低效解决方案(独立于版本):frompyspark.sql.typesimportStringTypefrompy

PySpark python 39 code mapping apache-spark dictionary apache-spark-sql

python - 从 Dataframe Pandas 中的句子中计算最常见的 100 个单词

我在Pandas数据框的一列中有文本评论，我想计算N个最常见的单词及其频率计数(在整列中-而不是在单个单元格中)。一种方法是通过遍历每一行来使用计数器对单词进行计数。有更好的选择吗？代表性数据。0ahearteningtaleofsmallvictoriesandendu1nosophomoreslumpfordirectorsammendesw2ifyouareanactorwhocanrelatetothesea3it'sthismemory-as-identityobviationthatg4boyd'sscreenplay(co-writtenwithguardian

中计句子 section 单词 code python pandas

python - Concat DataFrame Reindexing 仅对具有唯一值的 Index 对象有效

我正在尝试连接以下数据帧:df1pricesidetimestamptimestamp2016-01-0400:01:15.6313310720.7286214518656756313312016-01-0400:01:15.6313999360.7286214518656756314002016-01-0400:01:15.6318609920.7286214518656756318612016-01-0400:01:15.6318661120.728621451865675631866和:df2bidbid_sizeofferoffer_sizetimestamp2016-01-0

Reindexing DataFrame index self site-packages python numpy pandas

python - pandas DataFrame.to_sql() 函数 if_exists 参数不起作用

当我尝试将if_exists='replace'参数传递给to_sql时，出现编程错误，告诉我该表已存在:>>>foobar.to_sql('foobar',engine,if_exists=u'replace')...ProgrammingError:(ProgrammingError)('42S01',"[42S01][Microsoft][ODBCSQLServerDriver][SQLServer]Thereisalreadyanobjectnamed'foobar'inthedatabase.(2714)(SQLExecDirectW)")u'\nCREATETABLEfoo

DataFrame if_exists self site-packages Enthought python sql pandas

python - 从导致值错误的 CSV 文件将数据添加到 Pandas Dataframe

我正在尝试将int添加到PandasDataFrame中的现有值>>>df.ix['index5','TotalDollars']+=10我得到错误:ValueError:使用可迭代设置时必须具有相等的len键和值。我认为错误来自datatypeasgotfrom:>>>printtype(df.ix['index5','TotalDollars']int64数据框通过CSV文件填充。我尝试从另一个CSV文件加载数据库:>>>printtype(df.ix['index5','TotalDollars']int64是什么导致了这种类型上的差异？最佳答案

Dataframe python code section 39 csv numpy pandas

python - 在 SQLAlchemy 模型中存储 pandas DataFrame

我正在构建一个Flask应用程序，它允许用户上传CSV文件(具有不同的列)、预览上传的文件、生成汇总统计信息、执行复杂的转换/聚合(有时通过Celery作业)，然后导出修改后的数据。上传的文件正在被读入pandasDataFrame，这使我能够优雅地处理大部分复杂的数据工作。我希望这些DataFrame连同关联的元数据(上传时间、上传文件的用户ID等)能够持久存在，并可供多个用户传递到各种View。但是，我不确定如何最好地将数据合并到我的SQLAlchemy模型中(我在后端使用PostgreSQL)。我考虑过的三种方法:将DataFrame塞入PickleType并将其直接存储在数据库

SQLAlchemy DataFrame section 并将 python pandas flask

python - 在 pyspark 中创建一个大字典

我正在尝试使用pyspark解决以下问题。我在hdfs上有一个格式为查找表转储的文件。key1,value1key2,value2...我想将其加载到pyspark中的python字典中并将其用于其他目的。所以我尝试这样做:table={}defpopulateDict(line):(k,v)=line.split(",",1)table[k]=vkvfile=sc.textFile("pathtofile")kvfile.foreach(populateDict)我发现表变量没有被修改。那么，有没有办法在spark中创建一个大的内存哈希表？最佳答案

建一中创 code section 34 python apache-spark

136 137 138139140 141 142