hadoop - The best way to filter large data sets -
I have a question about how to filter the records related to the large data set of financial transactions. We use the Oracle 11G database and one of the requirements is to produce different end reports with all types of criteria.
Trade_matetight 18M rows, 10 GB Trade_Echemonics 18M rows, 15 GB business functions 18M rows, 11GB trade_trade_avity_link 18m rows, 3 GB < / P> / Code>
One of our reports is now taking ages (> 5 hours) to run. The underlying process has been adapted from time to time but keep adding new criteria so that we can start fighting again. This process is very standard - join all the tables and apply a host where clause (20 on final count).
I was thinking that I have a big problem which is enough to get rid of big data solutions every few months in this optimize-the-query game. In any case, the volumes are only going up. I've read a bit about Hadoop + HBase, Cassandra, Apache Pig, but being very new at this place, the best way to move forward, is a bit confused about it.
I think this is not a map - small problems offer HBZ filters, but I am not sure about their performance. Can enlightened people answer me a few questions here:
- Is the data large enough for large data solutions (Do I need to enter the billions club first?)
- If it is, will HBBE be a good choice to implement it?
- We are not going away from Oracle at any given time, although the volume is constantly increasing, am I populating the HDFS daily from the relevant table with a dump? Or can Delta write daily everyday?
Thanks a lot!
Welcome to the incredibly diverse Big Data ecosystem if your dataset size is large It is reinforcing your ability to analyze it using conventional devices, so it is very big for large data technologies. As you may have noticed, a large number of devices are available, many of which have overlapping capabilities.
First of all, you have not set up a cluster, you have not mentioned. If not, then I suggest seeing those products, these companies provide Hadoop distribution, which includes the most popular large data tools (HB, Spark, Squoop, etc.), and make it easy to configure and manage nodes that Build your cluster. Both companies offer their distribution free of charge, but you have to pay for support.
After this, you will need to get your data in Oracle and in some format in the Howau cluster. The equipment often used to obtain data from the relational database and in the cluster. Sqoop has the ability to load files in HBase, Hive, and Hadoop distributed filesystem (HDFS) in their tables. Sqoop also has the ability to make incremental import for updates rather than the entire table load. Which select destinations do you choose, what equipment you can use in the next step, HDFS is the most flexible one in which you can write it Pig, MapRADS code, Hive, Claudera Eplala and others. I have found that HB is very easy to use, but advises others to the hive.
On one side : Apache is a project called Spark, which is expected to be a replacement for Hadop, minimizing the map. Spark claims to have a 100x speedback versus traditional Hoop Mapreadus jobs. Many projects, including the hive, will run on SPARC, which gives you the ability to query SQL like large data and get results very quickly ()
Now your data is loaded, Need to run the end report. If you choose a hive, you can reuse your many SQL knowledge (it is not very hard) instead of learning Java or learning pig Latin. Pig has translated pag Latin into employment (as is the question of howe for now is language), but, as soon as hive, you choose for this phase, despite this, I recommend that bribe, anatiatics And automate the movement of results outside the cluster (see SQAP export for this). Oozie allows you to like so that you can focus on the result. Ozzy's full capabilities have been documented.
At your disposal there are a crazy number of tools, and the speed of change in this ecosystem can alert you. Both Cloida and Hortonwork provide virtual machines, which you can use to try your delivery. I strongly recommend spending less time searching deep on each device and just looking for some of them (such as hive, pig, ozzy, ...) which works best for your application) Suggest.
Comments
Post a Comment