9.sqoop-export

9.1.Purpose 目的

The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.

//导出工具导出一组文件从HDFS回到RDBMS。目标表必须已经存在于数据库中。根据用户指定的分隔符,输入文件读取和解析成一组记录。

The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database, and in "call mode" Sqoop will make a stored procedure call for each record.

//默认操作是把这些转换为一组INSERT语句,插入记录到数据库的记录。在 “更新模式”下,Sqoop将生成更新语句,取代在数据库中存在的记录,在“调用模式”下,Sqoop将为每条记录调用存储过程。

9.2.Syntax//语法

$ sqoop export (generic-args) (export-args)$ sqoop-export (generic-args) (export-args)

Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another.

虽然hadoop通用参数必须写在所有导入参数前,但是 导入的参数 互相之间可以以任何顺序输入。

Table18.Common arguments (共用的参数,前面已翻译过)

Argument Description
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class to use
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--connection-param-file <filename> Optional properties file that provides connection parameters

Table19.Validation arguments

Argument Description
--validate Enable validation of data copied, supports single table copy only. --validator <class-name>Specify validator class to use.
--validation-threshold <class-name> Specify validation threshold class to use.
+--validation-failurehandler <class-name >+ Specify validation failure handler class to use.

Table20.Export control arguments:

导出控制参数:

Argument Description
--direct Use direct export fast path 使用快速通道的direct模式导出
--export-dir <dir> HDFS source path for the export  HDFS资源导出路径
-m,--num-mappers <n> Use n map tasks to export in parallel 使用n个map任务并行导出
--table <table-name> Table to populate 填充的表名
--call <stored-proc-name> Stored Procedure to call  调用的存储过程
--update-key <col-name> Anchor column to use for updates. Use a comma separated list of columns if there are more than one column 标注用于更新的列,多于一列用逗号分隔.
--update-mode <mode> Specify how updates are performed when new rows are found with non-matching keys in datab指定执行时如何更新,当发现导入的行没有主键在数据库中.
Legal values for mode include updateonly (default) and allowinsert 合法的值包括  updateonly(默认)和 allowinsert  .
--input-null-string <null-string> The string to be interpreted as null for string columns 字符串列NULL的解释
--input-null-non-string <null-string> The string to be interpreted as null for non-string columns  非字符串列NULL的解释
--staging-table <staging-table-name> The table in which data will be staged before being inserted into the destination table.//指定临时表
--clear-staging-table Indicates that any data present in the staging table can be deleted//导出前是否清空分段表.
--batch Use batch mode for underlying statement execution 指定基础语句的执行使用批处理模式  .
staging table :直译为分段表,用于分段执行,可理解为临时表。

update-mode

Sqoop Export 工具默认地会将整个数据迁移工作分解为一系列针对数据库的 Insert 操作。对于数据库中的目标表为空表的情况,这种默认的方法并无不妥之处。但是,如果目标表非空且存在约束,那么 Export 的 Insert 操作就可能会有由于主键冲突等问题而导致的失败。目前,一条 Insert 操作的失败就会导致整个 Sqoop Export 任务的失败。为了规避以上描述的问题产生,用户可以利用 update-mode 参数。update-mode 参数定义了更新表操作的模式。它有两种模式:

  • updateonly

  • allowinsert

第一种模式 (updateonly) 是默认的模式,它会把整个 Export 的数据迁移工作分解为一系列的 Update 操作。而当更新表操作所参考的列则可以通过参数 update-key <col-name > 来指定。这种模式下,对于那些并不影响到已存在的表数据的 Update 操作 ( 比如,要 Update 的行在数据表中本来并不存在 ),不会导致任何失败,但也不会更新任何数据。但如果用户想在更新已存在的行的同时插入那些原先并不存在的行,就可以使用 allowinsert 模式——在这种模式下,对于不存在的列的操作会是 Insert,而非 Update。用户可以根据实际情况,合理选择 update-mode。

staging-table

正如上文所介绍的情况,当执行 Export 命令时,一条 Insert 语句的失败可能会导致整个 Export 任务的失败,但此时可能已经有部分数据插入到了数据库表中——这将会引起数据不完整的问题。一个理想的状态应该是要么所有数据都成功地更新进数据库,要么 数据库没有带来任何更新。为了解决这个问题,用户可以事先在数据库中创建一张临时表作为存储中间数据的分段表 (staging table)。分段表的结构和目标表完全一样,只有当所有命令都成功执行完后,数据才会从分段表转移到目标表里;而当任务失败时,目标表不会受到任何影 响。在使用 staging-stable 参数时,用户可以同时使用 clear-staging-table 参数——该参数确保在开始 Export 任务前清空分段表。

The --export-dir argument and one of --table or --call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.

--export-dir 和--table ,--call这两个参数的其中之一是必须的,这些命令指定填充到数据库中的表(或存储过程调用)和HDFS目录中包含的源数据。

You can control the number of mappers independently from the number of files present in the directory. Export performance depends on the degree of parallelism. By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The --num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.

你可以控制存在于目录中的文件的独立的映射器的数量,(独立的映射器 可以理解为map任务)。导出的性能依赖程度的并行度。默认情况下,Sqoop将为导出执行使用四个并行任务。这可能不是最佳的,你需要通过你自己特定设置来实验。追加任务数可能提供更好的并发性,但如果数据库已经在更新索引,调用触发诸如此类方面遇到瓶颈,然后额外的负载可能会降低性能。  --num-mappers-m 参数控制map任务的数量,即并行度。(并行时,每一个线程使用一个MAP任务,map任务数即并行数)

MySQL provides a direct mode for exports as well, using the mysqlimport tool. When exporting to MySQL, use the --direct argument to specify this codepath. This may be higher-performance than the standard JDBC codepath.

MySQL为 导出提供了一个direct模式,使用mysqlimport工具当导出到MySQL,使用  --direct 参数来指定这个codepath。这可能是比标准的JDBC codepath更高性能

[Note] Note

When using export in direct mode with MySQL, the MySQL bulk utility mysqlimport must be available in the shell path of the task process.

当通过
MySQL使用 direct模式,在执行任务的机器上,
通过shell命令,
MySQL的工具
mysqlimport必须是可用的。

The --input-null-string and --input-null-non-string arguments are optional. If --input-null-string is not specified, then the string "null" will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by --input-null-non-string.

--input-null-string and --input-null-non-string 参数是可选的,如果 --input-null-string没有指定,这时对于字符串列 “null” 字符串将被解释成空值,如果--input-null-non-string 没有指定,那么 ,对于非字符串列,字符串“null”和空字符串都将被解释为空值。注意,对于非字符串列,空字符串将总是解释为空值,除非通过 -input-null-non-string指定其他字符串。

这句翻译的很好,可以做为范例

Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.

因为Sqoop把导出过程分解成多个事务,这样可能出现 一个失败的导出工作可能导致部分数据被提交到数据库。这可能进一步导致后续工作失败由于插入冲突,或导致重复数据。你可以克服这个问题通过指定一个分段表通过  --staging-table选项它充当一个辅助表,用于分段导出的数据。分段数据最终转移到目标表在一个单独事务中(参考 9.2 staging表讲解)。

In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the --clear-staging-table option must be specified. If the staging table contains data and the --clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.

为了灵活使用staging,您必须创建staging表在导出工作运行前。此表在结构上必须与目标表相同。这个表应该要么是空的在导出之前的工作运行, 要么--clear-staging-table 选项必须被指定。如果staging表包含数据并且 --clear-staging-table 选项被指定,那么Sqoop将删除该表的所有数据在导出工作开始前。

[Note] Note

Support for staging data prior to pushing it into the destination table is not available for --direct exports. It is also not available when export is invoked using the --update-key option for updating existing data, and when stored procedures are used to insert the data.

staging 数据在direct模式中不被支持,它也不支持修改已经在的数据和使用存储过程插入数据。

9.3.Inserts vs. Updates //Inserts , Updates比较

By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.

默认情况下,sqoop-export附加新行到一个表,每个输入记录转化为一个INSERT语句添加一行到目标数据库表。如果您的表有约束(如。,一个主键列的值必须是唯一的)并且已经包含数据,你必须小心去避免插入违反这些约束的记录,。如果一个INSERT语句失败,导出过程将会失败。这种模式主要是针对导出记录到一个新表或空表。

If you specify the --update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key. For example, consider the following table definition:

如果你指定  --update-key 参数,Sqoop将修改现有数据库中的数据集。每个输入记录被视为一个更新语句,修改一个现有的行。修改那一行是由 --update-key 指定的参数决定的 。例如,考虑下面的表定义:

CREATE TABLE foo(    id INT NOT NULL PRIMARY KEY,    msg VARCHAR(32),    bar INT);

Consider also a dataset in HDFS containing records like these:

考虑在HDFS数据集也包含这样的记录:

0,this is a test,421,some more data,100...

Running sqoop-export --table foo --update-key id --export-dir /path/to/data --connect … will run an export job that executes SQL statements based on the data like so:

运行 sqoop-export --table foo --update-key id --export-dir /path/to/data --connect … ,将运行一个导出工作,在数据库上执行的SQL语句像这样:

UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;UPDATE foo SET msg='some more data', bar=100 WHERE id=1;...

If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with --update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.

如果一个更新语句修改没有行,这不被认为是一个错误;导出将安静地的继续。(实际上,这意味着一个基于更新的导出不会插入新行到数据库中)。同样,如果列指定 --update-key 不唯一地标识行,以条更新语句将修改多行,这种情况也不会被检测(不会报错)。

The argument --update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.

这个参数--update-key 也可以给定一个逗号分隔的行列表(t1,t1,t3)。在这种情况下,Sqoop将匹配联合更新条件(即 where value1=t1 and value2=t2 and value3=t3)从这个列表更新任何现有的记录之前。

Depending on the target database, you may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.

根据目标数据库,您还可以指定 --update-mode 参数 的allowinsert模式,可以实现如果数据行存在就更新如果数据行不存在就插入。

Table21.Input parsing arguments:

参考file format那个章节

Argument Description
--input-enclosed-by <char> Sets a required field encloser 设置一个必用的字段关闭符
--input-escaped-by <char> Sets the input escape character 设置输入转义符
--input-fields-terminated-by <char> Sets the input field separator 设置输入字段分隔符
--input-lines-terminated-by <char> Sets the input end-of-line character 设置输入的 行结束符
--input-optionally-enclosed-by <char> Sets a field enclosing character 设置一个可选的闭合符

Table22.Output line formatting arguments(上面有解释):

Argument Description
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
--mysql-delimiters Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by:'
--optionally-enclosed-by <char> Sets a field enclosing character

Sqoop automatically generates code to parse and interpret records of the files containing the data to be exported back to the database. If these files were created with non-default delimiters (comma-separated fields with newline-separated records), you should specify the same delimiters again so that Sqoop can parse your files.

Sqoop自动生成代码来解析并解释包含数据的文件记录,然后导出数据到数据库中。如果这些文件用非默认的分隔符创建(逗号分隔的字段和换行符分隔的记录),你应该再次指定相同的分隔符以便于Sqoop可以解析你的文件。

If you specify incorrect delimiters, Sqoop will fail to find enough columns per line. This will cause export map tasks to fail by throwing
ParseExceptions.

如果你指定不正确的分隔符,Sqoop将无法找到足够的每行的列。这将导致导出map任务失败, 抛出ParseExceptions。

Table23.Code generation arguments:

Argument Description
--bindir <dir> Output directory for compiled objects
--class-name <name> Sets the generated class name. This overrides --package-name. When combined with --jar-file, sets the input class.
--jar-file <file> Disable code generation; use specified jar
--outdir <dir> Output directory for generated code
--package-name <name> Put auto-generated classes in this package
--map-column-java <m> Override default mapping from SQL type to Java type for configured columns.

If the records to be exported were generated as the result of a previous import, then the original generated class can be used to read the data back. Specifying --jar-file and --class-name obviate the need to specify delimiters in this case.

如果要导出的记录是由之前的导入生成的结果,然后原来的class可以用来读取数据。指定 --jar-file and --class-name ,在这种情况下就不需要指定分隔符了。

The use of existing generated code is incompatible with --update-key; an update-mode export requires new code generation to perform the update. You cannot use --jar-file, and must fully specify any non-default delimiters.

使用现有的class代码(上次导入时生成的)与--update-key不兼容 ,一个更新模式导出需要新生成的代码来执行更新。你不能使用 --jar-file,  ,必须完全指定任何非默认分隔符。

9.4.Exports and Transactions //导出和事务

Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.

导出是由多个writer并行执行。每个writer使用一个单独的数据库连接;这些有单独的事务互相间。Sqoop使用多行插入语法,每个语句插入100条记录。每执行100个语句,一个写入任务内的当前事务被提交,导致提交事务每插入10000行。这将确保事务缓冲区不会无限制地增长,并导致内存不足的情况。因此,导出不是一个原子的过程。在导出之前完成前,导出的部分结果将成为可见的(部分结果指的是从HDFS导出到数据库的数据的一部分)。

Exports may fail for a number of reasons:

一些原因可能导致导出失败:

         Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes) //Hadoop集群到数据库的连接丢失 (无论是由于硬件故障,或服务器软件崩溃)

  • Attempting toINSERTa row which violates a consistency constraint (for example, inserting a duplicate primary key value)试图插入违背了一致性约束的行(例如,插入一个重复的主键值)

  • Attempting to parse an incomplete or malformed record from the HDFS source data// 试图 从HDFS源数据 解析一个不完整的或有缺陷的记录

  • Attempting to parse records using incorrect delimiters//试图使用不正确的分隔符解析记录

  • Capacity issues (such as insufficient RAM or disk space)容量问题(比如内存或磁盘空间不足)

If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in aseparatetransaction. Furthermore,individualmap taskscommittheir current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.

separate ,individual都有独立的意思 separate :强调分开的,用于复数,各自的 individual:强调个体 用于单数

due to: 由于 。。的原因

如果由于上述或其他原因一个导出的Map任务失败,这将导致导出工作失败。一个失败的结果是不确定的出口。每个导出map任务运行在一个单独的事务。此外,单个任务定时提交当前事务。如果任务失败,当前事务将回滚。任何已经提交了的事务在数据库中仍将是生效的,导致部分导出完整。

9.6.Example Invocations

A basic export to populate a table namedbar:

填充一个命名为bar的表的一个基本的导出:

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \    --export-dir /results/bar_data

This example takes the files in/results/bar_dataand injects their contents in to thebartable in thefoodatabase ondb.example.com. The target table must already exist in the database. Sqoop performs a set ofINSERT INTOoperations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.

这个示例取出/results/bar_data中的文件并注入他们的内容到db.example.com的foo数据库的bar表。目标表必须已经存在于数据库中。Sqoop执行一组插入操作,不考虑现有内容。在数据库中,如果Sqoop试图插入违反约束的行(例如,一个特定的主键值已经存在),这时导出失败。

Another basic export to populate a table namedbarwith validation enabled:

在启用验证情况下填充一个命名为bar的表的一个基本的导出 :更详细

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \    --export-dir /results/bar_data --validate

An export that calls a stored procedure namedbarprocfor every record in/results/bar_datawould look like:

一个导出,为/results/bar_data中的每条记录调用一个命名为barproc的存储过程会看起来像:

$ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
--export-dir /results/bar_data