Friday, August 28, 2009

Split utility of Unix

There is a 'split' utility in unix, which allows one to split a sufficiently big file into small chunks of equal sizes. The man pages of split give out more options that could be used based on the need, however a simple example has been shown below in which a text file having 1000 lines has been split into 10 equal chunks of 100 lines each.

The split files have been merged with the help of a 'cat' command in a 'for' loop. Extreme care must be taken when splitting binary/dmp files as splitting succeeds but the merging shows misleading results and the file size of the subjected file before and after the split does not seem to match.

First off, we create a file named 'un_split_file.out', which has 1000 lines - a partial look of it is shown below




UNIX:/prd/u01/acme> export i=0
UNIX:/prd/u01/acme> echo $i
0
UNIX:/prd/u01/acme> while [ "$i" -ne 1000 ]
> do
> echo "This is line $i" >> un_split_file.out
> i=`expr $i \+ 1`
> done &

UNIX:/prd/u01/acme> wc -l split_file.out
1000 split_file.out

UNIX:/prd/u01/acme > head -10 un_split_file.out
This is line 0
This is line 1
This is line 2
This is line 3
This is line 4
This is line 5
This is line 6
This is line 7
This is line 8
This is line 9

UNIX:/prd/u01/acme > tail -10 un_split_file.out
This is line 990
This is line 991
This is line 992
This is line 993
This is line 994
This is line 995
This is line 996
This is line 997
This is line 998
This is line 999

UNIX:/prd/u01/acme > ls -ltr un_split_file.out
-rw-r--r-- 1 oracle dba 16890 Aug 25 05:19 un_split_file.out



Now comes the usage of 'split' command, here 'split' has been passed with 4 arguments



-l 100 -> Line Count, which means after every 100 lines from the beginning of
the files, a new file will be created

-a 2 -> Based on the line count parameter,required number of split files
will be created with a 2 characted substring. The substring by default has
the following trend aa, ab, ac and so on.

Third argument is the name of the file to be split

Fourth argument is the text for naming of the split files


UNIX:/prd/u01/acme> split -l 100 -a 2 un_split_file.out split_file.part_

UNIX:/prd/u01/acme> ls -ltr
total 544
-rw-r--r-- 1 oracle dba 16890 Aug 25 05:19 un_split_file.out
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_aj
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ai
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ah
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ag
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_af
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ae
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ad
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ac
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ab
-rw-r--r-- 1 oracle dba 1590 Aug 25 05:22 split_file.part_aa
UNIX:/prd/u01/acme>





The old file 'un_split_file.out' will be moved to 'un_split_file.out.deleted' so it
does not conflict with the new file that will be created by merging the split files
using the 'cat' command.





UNIX:/prd/u01/acme> mv un_split_file.out un_split_file.out.deleted

UNIX:/prd/u01/acme> for i in `ls split_file.part*`
> do
> cat $i >> un_split_file.out
> done &




The result is the creation of a file named 'un_split_file.out' which has just all
the contents like it did before being split. The split/merge operation does not
remove the source or the original files as seen below.




UNIX:/prd/u01/acme> ls -ltr un_split*
-rw-r--r-- 1 oracle dba 16890 Aug 25 05:19 un_split_file.out.deleted
-rw-r--r-- 1 oracle dba 16890 Aug 25 05:25 un_split_file.out
UNIX:/prd/u01/acme> wc -l un_split_file.out
1000 split_file.out
UNIX:/prd/u01/acme>

UNIX:/prd/u01/acme > ls -ltr
total 578
-rw-r--r-- 1 oracle dba 16890 Aug 25 05:19 un_split_file.out.deleted
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_aj
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ai
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ah
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ag
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_af
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ae
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ad
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ac
-rw-r--r-- 1 oracle dba 1700 Aug 25 05:22 split_file.part_ab
-rw-r--r-- 1 oracle dba 1590 Aug 25 05:22 split_file.part_aa
-rw-r--r-- 1 oracle dba 16890 Aug 25 05:25 un_split_file.out


No comments: