hadoop - The best way to filter large data sets -

- February 15, 2012

I have a question about how to filter the records related to the large data set of financial transactions. We use the Oracle 11G database and one of the requirements is to produce different end reports with all types of criteria.

  Trade_matetight 18M rows, 10 GB Trade_Echemonics 18M rows, 15 GB business functions 18M rows, 11GB trade_trade_avity_link 18m rows, 3 GB < / P>  / Code>

  One of our reports is now taking ages (> 5 hours) to run. The underlying process has been adapted from time to time but keep adding new criteria so that we can start fighting again. This process is very standard - join all the tables and apply a host where clause (20 on final count). 
  I was thinking that I have a big problem which is enough to get rid of big data solutions every few months in this optimize-the-query game. In any case, the volumes are only going up. I've read a bit about Hadoop + HBase, Cassandra, Apache Pig, but being very new at this place, the best way to move forward, is a bit confused about it. 
  I think this is not a map - small problems offer HBZ filters, but I am not sure about their performance. Can enlightened people answer me a few questions here: 
   Is the data large enough for large data solutions (Do I need to enter the billions club first?) 
  If it is, will HBBE be a good choice to implement it? 
  We are not going away from Oracle at any given time, although the volume is constantly increasing, am I populating the HDFS daily from the relevant table with a dump? Or can Delta write daily everyday? 
 
  Thanks a lot!

  
  Welcome to the incredibly diverse Big Data ecosystem if your dataset size is large It is reinforcing your ability to analyze it using conventional devices, so it is very big for large data technologies. As you may have noticed, a large number of devices are available, many of which have overlapping capabilities. 
  First of all, you have not set up a cluster, you have not mentioned. If not, then I suggest seeing those products, these companies provide Hadoop distribution, which includes the most popular large data tools (HB, Spark, Squoop, etc.), and make it easy to configure and manage nodes that Build your cluster. Both companies offer their distribution free of charge, but you have to pay for support. 
  After this, you will need to get your data in Oracle and in some format in the Howau cluster. The equipment often used to obtain data from the relational database and in the cluster. Sqoop has the ability to load files in HBase, Hive, and Hadoop distributed filesystem (HDFS) in their tables. Sqoop also has the ability to make incremental import for updates rather than the entire table load. Which select destinations do you choose, what equipment you can use in the next step, HDFS is the most flexible one in which you can write it Pig, MapRADS code, Hive, Claudera Eplala and others. I have found that HB is very easy to use, but advises others to the hive. 
   On one side : Apache is a project called Spark, which is expected to be a replacement for Hadop, minimizing the map. Spark claims to have a 100x speedback versus traditional Hoop Mapreadus jobs. Many projects, including the hive, will run on SPARC, which gives you the ability to query SQL like large data and get results very quickly () 
  Now your data is loaded, Need to run the end report. If you choose a hive, you can reuse your many SQL knowledge (it is not very hard) instead of learning Java or learning pig Latin. Pig has translated pag Latin into employment (as is the question of howe for now is language), but, as soon as hive, you choose for this phase, despite this, I recommend that bribe, anatiatics And automate the movement of results outside the cluster (see SQAP export for this). Oozie allows you to like so that you can focus on the result. Ozzy's full capabilities have been documented. 
  At your disposal there are a crazy number of tools, and the speed of change in this ecosystem can alert you. Both Cloida and Hortonwork provide virtual machines, which you can use to try your delivery. I strongly recommend spending less time searching deep on each device and just looking for some of them (such as hive, pig, ozzy, ...) which works best for your application) Suggest.




















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




apache - 504 Gateway Time-out The server didn't respond in time. How to
fix it? -



-



May 15, 2013








    Using a form submission on an embedded  iframe , the customer downloads a compressed log file Requested. The request was sent to the server, which is the compressed log files, perform some database operations and returns a compressed file.   After just 2 minutes,  504 gateway time-out server did not respond timely  message on browser net panel How to fix this error?      The log files were taking a long time to compress, and timeout was set to 2min   The error was fixed by extending the file file:    # # timeout: The number of seconds before getting the time out. # # # Timeout 120 timeout 600      





Read more





c# - .net WebSocket: CloseOutputAsync vs CloseAsync -



-



July 15, 2014








    We have a working ASP.NET Web API REST service, which is one of the methods of our controller, HTTPTTEX. .)   Socket handler code looks something like this ...    Public async task socket handler (AspNetWebSocketContext context) {_webSocket = context.WebSocket; ... while (! Cts.IsCancellationRequested) {WebSocketReceiveResult Results = _webSocket.ReceiveAsync (Input Segment, cts.Token) .Result; WebSocketsStateCollusocketState = _webSocket.State; If (result.MessageType == WebSocketMessageType.Close || currentSocketState == WebSocketState.CloseReceived) {// What should I use. CloseAysnc () or. CloseOutputAsync ()? _webSocket.CloseOutputAsync (WebSocketCloseStatus.NormalClosure, "Client Requested", cts.Token). Wait (); } If (currentSocketState == WebSocketState.Open) {...}}}    .What is the difference between .CooseAsync () and CloseOutputAysnc ()? I tried both of them and they both seemed to work fine but some difference should be the same they both describe very similar to...





Read more





c++ - How to properly scale qgroupbox title with stylesheet for high
resolution display? -



-



January 15, 2013








    I am trying to apply a stylesheet for QGroupBox (QT4.8), which works on the normal screen ( 96 dpi) high resolution screen (Yoga 2 Pro, 3200x1800, 275 dpi, windows 8.1).   The following style looks good on the 275 dpi screen, but the top margins on a regular 96 dpi screen are far too big.    QGroupBox {border: 1px solid red; Range radius: 7px; Margin-top: 12x; } QGroupBox :: Title {subcontrol-origin: margin; Subcontrol-position: left above; Padding-left: 10px; Padding-right: 10px; }    Changing the top-margin has an effect, but I did not get a proper setting which works on both screens. If I shorten the value, the content of the group box overlaps with the title on 275 dpi screen. I was also playing with units "East", "PX", "MX", "PT". Reading the document I would have guessed, "2 X" was the correct solution, which should be scaled with different screen resolutions.   Without the stylesheet, the groupbox works well on both screens. ...





Read more

Search This Blog

Updating

hadoop - The best way to filter large data sets -

Comments

Post a Comment

Popular posts from this blog

apache - 504 Gateway Time-out The server didn't respond in time. How to fix it? -

c# - .net WebSocket: CloseOutputAsync vs CloseAsync -

c++ - How to properly scale qgroupbox title with stylesheet for high resolution display? -