python - pyspark: Save schemaRDD as json file -


I am looking for various other devices in JSON format to export data from Apache Spark. I think it should be a very easy way to do this.

Example: I have the following JSON file 'jfile.json':

  {"key": value_a1, "Key2": value_b1}, {"key": Value_a2, "key2": value_b2}, {...}  

Where each line in the file is a JSON object

These types of files can be easily read in Piespark, and then the jsonRDD line with

  jsonRDD = jsonfile ('jfile.json')  

(Key = value_a2, key2 = value_b2)] [line (value = value_a1, key2 = value_b1) Now I get these types of files pure JSON I want to save the goal.

I found this entry in the Spark User List:

After doing this, the text file appears

  line (key = value_a1, key2 = value_b1) line (key = Value_a2, key2 = value_b2)  

, i.e., JasonRDD is clearly written in the file. After reading the Spark User List entry, I was expecting an "Automagic" conversion back in the Jason format.

Am I missing an easy easy way to do this?

I read, Google, user list and stack overflow to answer search, but almost all answers read and parse JSON in Spark. I also bought the book 'Learning Spark', but the examples there (page 71) only move to an output file just like the above.

Can anyone help me here? I think I just missing a small link here

Cheers and thanks in advance!

I can not see an easy way to do this. Converting every element of the schemRDD to a string for a solution ends with a RDD [String] , where each element Formatted JSON for that line, therefore, you have to write your own JSON serializer. This is the easy part. It can not be super fast but it should work in parallel, and you already know how to save RDD in a text file.

The main insights is that you can get a representation of schema outside of SchemaRDD by calling the schema method. Then, each line you have to recapture the with schema to be recurrently signed by the map. This is actually a tandem list traversal for flat JSON, but you may also need to consider Nested JSON.

The rest of the Python is a small thing, which I do not speak, but if it helps you, then it is like that. In areas where scala code is dense, really does not depend on deep spark knowledge, if you can understand the original recycling and understand the dragon, then you should be able to work it out. A large part of the work for you is finding out how to work with a pyspark.sql.row and a pyspark.sql.StructType in the Python API.

One word of caution: I am absolutely sure that my code does not work in the case of unavailable values ​​- formatItem method needs to control empty elements < / P>

edit: SPARC 1.2.0 for toJSON method SchemaRDD It was making an lot easy problem - see answer by @JeGarden.


Comments

Popular posts from this blog

apache - 504 Gateway Time-out The server didn't respond in time. How to fix it? -

c# - .net WebSocket: CloseOutputAsync vs CloseAsync -

c++ - How to properly scale qgroupbox title with stylesheet for high resolution display? -