python - pyspark: Save schemaRDD as json file -
I am looking for various other devices in JSON format to export data from Apache Spark. I think it should be a very easy way to do this.
Example: I have the following JSON file 'jfile.json':
{"key": value_a1, "Key2": value_b1}, {"key": Value_a2, "key2": value_b2}, {...}
Where each line in the file is a JSON object
These types of files can be easily read in Piespark, and then the jsonRDD line with
jsonRDD = jsonfile ('jfile.json')
(Key = value_a2, key2 = value_b2)] [line (value = value_a1, key2 = value_b1) Now I get these types of files pure JSON I want to save the goal.
I found this entry in the Spark User List:
After doing this, the text file appears
line (key = value_a1, key2 = value_b1) line (key = Value_a2, key2 = value_b2)
, i.e., JasonRDD is clearly written in the file. After reading the Spark User List entry, I was expecting an "Automagic" conversion back in the Jason format.
Am I missing an easy easy way to do this?
I read, Google, user list and stack overflow to answer search, but almost all answers read and parse JSON in Spark. I also bought the book 'Learning Spark', but the examples there (page 71) only move to an output file just like the above.
Can anyone help me here? I think I just missing a small link here
Cheers and thanks in advance!
I can not see an easy way to do this. Converting every element of the schemRDD
to a string
for a solution ends with a RDD [String]
, where each element Formatted JSON for that line, therefore, you have to write your own JSON serializer. This is the easy part. It can not be super fast but it should work in parallel, and you already know how to save RDD
in a text file.
The main insights is that you can get a representation of schema outside of SchemaRDD
by calling the schema
method. Then, each line
you have to recapture the with schema to be recurrently signed by the map. This is actually a tandem list traversal for flat JSON, but you may also need to consider Nested JSON.
The rest of the Python is a small thing, which I do not speak, but if it helps you, then it is like that. In areas where scala code is dense, really does not depend on deep spark knowledge, if you can understand the original recycling and understand the dragon, then you should be able to work it out. A large part of the work for you is finding out how to work with a pyspark.sql.row
and a pyspark.sql.StructType
in the Python API.
One word of caution: I am absolutely sure that my code does not work in the case of unavailable values - formatItem
method needs to control empty elements < / P>
edit: SPARC 1.2.0 for toJSON
method SchemaRDD
It was making an lot easy problem - see answer by @JeGarden.
Comments
Post a Comment