Scheduling tool (ETL+ task flow)

Kettle is an ETL tool, ETL (short for extract-transform-load).

The Chinese name of kettle is kettle. MATT, the chief programmer of this project, wants to put all kinds of data in a kettle and then stream it out in a specified format.

So his focus is on data.

Oozie is a workflow and a set of actions (such as Hadoop's Map/Reduce job, Pig job, etc.). ) is placed in a control dependency DAG (direct acyclic graph), in which the execution order of actions is specified.

Oozie workflow has data flow, but the focus is on the definition of workflow.

Although both of them have related functions and data streams, their uses are actually different.

View help

List all databases on linux.

List all databases in the window.

View all tables under the database.

(1) Make sure the mysql service starts normally.

Query the control port and query process to determine whether mysql is in the startup state.

Method 1: query port

Port 3306 of TCP that MySQL listens to, if 3306 is displayed, it proves that MySQL service is running.

Method 2: Inquiry process

You can see the progress of mysql.

The directory to import data is not specified, and the default directory is /user/root/ table name.

Reason:

If there is a primary key in the table, the value of m can be set to a value greater than 1 If there is no primary key, the value of m can only be set to1; Or to make the value of m greater than 1, you need to specify a field with-split-by.

Setting -m 1 means that only one maptask performs data import. By default, four maptasks perform import operations, but you must specify a column as the basis for division.

Import data into the specified directory.

When importing table data into HDFS, we can use Sqoop import tool to specify the target directory. Use the parameter -target-dir to specify the export target, and use the parameter -delete-target-dir to determine whether the export directory exists, and if so, delete it.

Query import

Tip:' $ conditions' must be included in the where clause.

Where id < =1matching condition

$ condition: transmission effect.

If you use double quotes after a query, you must add an escape character before $CONDITIONS to prevent the shell from recognizing it as its own variable.

-table when-query cannot be used.

You need to specify the-target-dir path.

Import to the directory specified by hdfs and specify the requirements.

Data export storage mode (data storage file format -(text fil Parquet)-as-text file imports data in plain text format (default) -as-Parquet file imports data into Parquet file).

Import a subset of table data into HDFS.

Sqoop imports blob data into the hive.

For CLOB, such as xml text, sqoop can be migrated to Hive table, and the corresponding fields are stored as character types.

For BLOB, such as jpg images, sqoop cannot be directly migrated to Hive table, but can only be migrated to HDFS path first, and then loaded to Hive table with Hive command. After migrating to HDFS, the BLOB field is stored as 16.

2. 1.3 Import the relational table into Hive.

Step 1: import the required jar package.

In order to import the data from mysql table directly into hive table, we need to copy a jar package named Hive-exec-1.1.0-CDH5.14.0.jar into sqoop lib directory.

Step 2: Start the import

Import the relational table into the hive and automatically create the hive table.

Students can also import our mysql tables directly into hive tables through commands.

Through this command, we can directly pour the data and table structure in our mysql table into hive.

-Incremental incremental mode.

Append id is to get a value of a column.

Modify "2016-12-1515: 47: 35" for the last time to obtain all the modified data after a certain period of time.

-attach additional mode

-merge- key id merge mode

-check-column is used to specify some columns, and multiple columns can be specified; A primary key id is usually specified.

-What value does the increment of the last value start from?

= = Note: During incremental import, the parameter-delete-target-dir-dir cannot be added, otherwise an error = = will occur.

The first incremental import method (not commonly used)

1. Additional mode

Usage scenario: there is an order table, in which each order has a unique self-added column id, which exists in the relational database in the form of primary key. HDFS； has been imported before the order with id between 0- 1000; If a new order is generated, we only need to specify the incremental parameter as append and the-last-value parameter as 1000, which means that the import will only start after the id is greater than 1000.

(1) Create a MySQL table.

(2) create a hive table (the table structure is consistent with mysql)

note:

Append mode does not support writing to the hive table.

2.lastModify mode

This method requires that the original table has a time field and can specify a timestamp, so sqoop can import the data after the timestamp into HDFS; ; Because subsequent orders may change, the timestamp of the time field will also change after the change. At this point, sqoop will still import orders in the same status into HDFS. Of course, we can only regard the merge-key parameter as order-id, which means that the new record will be merged with the original record.

# Import data increments with time series greater than or equal to the threshold into HDFS.

When importing data through the lastmodified method, you need to specify whether the incremental data should be-append or-merge-key. The value specified by last-value will be included in the incremental imported data.

The second incremental import method (recommended)

= = It is more accurate to select data by where condition = =

2. 1.5 from RDBMS to HBase

An error will be reported

Reason: sqoop 1.4.6 only supports the function of automatically creating HBase tables before HBase 1.0. 1

Solution: create the HBase table manually.

The target table must exist in the target database before exporting.

The default operation is to insert the data in the file into the table using the insert statement.

The data is in the following directory /sqoop/emp of HDFS, and the data content is as follows.

Step 1: Create MySQL table.

Step 2: Execute the export command.

Export data through export, and export hdfs data to mysql.

Total derivative

Incremental derivative

Update export

Summary:

Parameter introduction

-update-key can also be followed by multiple keyword column names, separated by commas. Sqoop will match multiple keywords before performing the update operation.

The-export-dir parameter is used with the-table or-call parameter to specify the file set directory on the HDFS of MySQL that needs to import data.

-update modes updateonly and allowinsert. The default mode is update only. If the-update-mode mode is specified as allowinsert, you can also import data that does not exist in the target database into the database table. The data will be updated soon, and there is no data insertion.

Combination test and description

1. When update-key is specified and a primary key exists in the relational database table:

A. In allowinsert mode, the contents stored in the target database table are updated, and the data that did not exist originally are also imported into the database table;

B, in the updateonly mode, updating the contents stored in the target database table, and not importing the originally nonexistent data into the database table;

2. When update-key is specified and there is no primary key in the relational database table:

A. In the allowinsert mode, all data are additionally imported into the database table;

B, in the updateonly mode, updating the contents stored in the target database table, and not importing the originally nonexistent data into the database table;

3. When update-key is not specified and there is a primary key in the relational database table:

A. in allowinsert mode, the primary key conflicts and the data remains unchanged;

B. In updateonly mode, the primary key conflicts and the data remains unchanged;

4. When update-key is not specified and there is no primary key in the relational database table:

A. In the allowinsert mode, all data are additionally imported into the database table;

B, in updateonly mode, all data are additionally imported into the database table;

Actual case:

(1)mysql bulk import hive

Use shell scripts:

At present, the author uses sqoop to import mysql data into Hive, and finally realizes the command line import, sqoop version 1.4.7, as follows.

Finally, you need to make this import a job and run it regularly every day to realize the automatic incremental import of data. Sqoop supports job management. It can create an import and run it repeatedly in the job. It will record the increment in the metastore and query it before each incremental import.

The create job command is as follows

After you create a job, you can execute it.

Sqoop job execution user

You can set this instruction as a Linux timed task, or use Azkaban to execute it regularly.

Did the date data change when hive was exported to MySQL?

Cause of the problem: Time zone setting problem. date -R checks the server time, and displays a variable like "%time_zone" to check Mysql time. The system does not represent the standard time in China, so the time should be set to Dongba District.

(1): In view of the two most popular schedulers on the market at present, a detailed comparison is given below for technical selection reference. Overall, ooize is a heavyweight task scheduling system compared with Azkaban, with comprehensive functions, but more complicated configuration and use. Azkaban, a lightweight scheduler, is a good candidate if you can ignore the lack of some functions.

(2): Function:

Both of them can schedule mapreduce, pig, java and script workflow tasks;

Both of them can execute workflow tasks regularly;

(3): workflow definition:

Azkaban uses property files to define workflows;

Oozie uses XML file to define workflow;

(4): workflow reference:

Azkaban supports direct parameter passing, such as $ {input }；;

Oozie supports parameters and EL expressions, such as $ {fs: dirsize (my input dir)};

(5) Routine execution:

Azkaban's scheduled mission is based on time;

Oozie's scheduled execution tasks are based on time and input data;

(6): Resource management:

Azkaban has strict authority control, such as user reading/writing/executing workflow;

Oozie has no strict access control for the time being;

(7): Workflow execution:

Azkaban has two operation modes, namely solo server mode (executor server and web server are deployed at the same node) and multi server mode (executor server and web server can be deployed at different nodes).

Oozie runs as a workflow server and supports multi-user and multi-workflow;

(8): Workflow management:

Azkaban supports browsers and ajax operation workflows;

Oozie supports command line, HTTP REST, Java API and browser operation workflow;

Browser page access

http://node03:808 1/

When using Oozie, hue and user data warehouse scheduling are usually integrated.

This is the script you just selected.

Parameters needed in the script should be set to be dynamically and automatically obtained as much as possible, such as ${date}

The parameters of the first step are all files and today's date, and only the date is needed later. The final step is to export all the results and populate them accordingly.

Add files and set corresponding parameters.

After running, there will be a status prompt page, where you can see the progress of the task.

Click the task scheduling page.

Modify the name and description of the scheduled task.

Add tasks that need to be scheduled regularly.

The parameters of sm-workflow are all written dead. If dynamic is not set, there will be no options in the drop-down list here.

Setting parameters

Change the date of the sm workflow to ${do_date} and save it.

Enter the time series plan sm-dw, and you will see the parameter do_date.

Fill in the corresponding parameters, the date of the previous day

Oozie universal system constant

Of course, you can also pass parameters to the parameters needed in workflow tasks, code or shell.

For example, modify sqoop_import.sh in sm-workflow and add a parameter ${num}.

Edit the file (the user who needs to log in to Hue has the right to operate HDFS), modify a value in the shell as a parameter, and save it.

In the workflow, edit and add the parameter ${num} or num=${num} to be saved.

Enter the schedule, you can see the added parameters, and edit and input the corresponding parameters.

Bundle manages all scheduled schedules in a unified way, and the stages are divided into: Bundle >；; Timetable & gt workflow

Does the Jiangsu-Shaanxi Cooperation Fund need to be returned?

Which universities can science 5 10 attend?

Statement by the head of the sixth grade

How to write the word "number" in advertising language?

Entrance slogan, formation, number of people, and clothing for the Sports Meeting of the Academy of Life Sciences.

James live slogan

Why is the second season 10 of Travel Notes of Youth gone?

Speech on National Unity in 2022

What is the main purpose of the paper review?

Where is your brand?