Huge Text File, Need to Extract Specific Lines? Here's How

If like me, you have to manipulate huge text files containing thousands of lines of data and you hate trawling through them manually, slowly, one-by-one, line-by-line then boy do I have a few tricks to teach you.

These tricks of mass manipulation of text files will turn your week long, heart-wrenching job into a 5 minute heart-lifting pleasure bordering on the orgasmic [just trying to liven things-up].

I used to export and paste my text files into an Open Office spreadsheet then sort them before cutting and pasting. Doing so used to cut in half the time taken to edit my text files. I even pseudo-shuffled lines of data by typing letters of the alphabet beside each data line then sorting them into groups of like letters (e.g all a’s together, all b’s together…). They were the Dark Years (o.k Weeks ;-) but I’m allowed ) to tell my tale my way; and if I want to make people feel sorry for us lonely, depressed webmasters then I will. I’m glad those days are over.

I like to get things done as quickly as possible. I don’t like to dawdle. A trait that has put me out of really good jobs and embarrassed others at the same time: I’ve built spreadsheets and devised processes that convert week-long jobs into 2 hour, Monday morning tasks; who needs an extra employee once he’s killed his job position :-? But that didn’t bother me. Progress and evolution, onwards and upwards; they are my mottoes.

So, let’s get down to business. As you know, I use Linux. Linux has some fantastic inbuilt scripts (commands) that facilitate the manipulation of large text files: from bulk data extraction and data replacement to mass data sorting, shuffling and deletion. Here are a few commands and usage examples to help you earn some free time to look busy while playing game emulators (just don’t tell your boss).

To use these commands, either

specify the full directory path to your file when you type in your file’s name e.g /home/user/Documents/source.txt;
use your terminal of choice to navigate to the folder that contains your file then only specify your file’s name (no need for the full path); or, graphically,
use your file manager (Dolphin or Konqueror) to navigate to the folder that contains the file from which you wish to extract or copy data then open a terminal (e.g Konsole) from that position (right-click the containing folder and left-click “Open Terminal Here” else, in Dolphin, when in the containing folder, click “Tools” in the “Top Bar” then select “Open Terminal”. This is the easiest method. Using the terminal commands in this manner means you will not need to type full directory paths e.g type only the file’s name.

One note of caution: if your source data file has been created or edited within a Windows environment then you must open the file in a Linux text editor and re-save it before you use any of these Linux terminal commands to manipulate the data. The reason is this: Linux and Windows use different encoding to signal the termination of a line (the start of a new line); because of this, Linux will interpret your data file’s individual data lines as being one continuous line of data unless you do the open and save action.

To the best of my knowledge, all these commands can be used on any file type be it txt, doc, sql, shtml, xml, php…

Be aware that Linux is case sensitive i.e “Apple” is not the same as “apple” or “aPPLe” or “APPLE”. This applies to all text whether in a file’s name or in the data held in a file.

Split Files into Chunks

split -l [number of lines] [filename]

or, if you haven’t navigated to the folder that contains the file, you would use

split -l [number of lines] [Directory Path/filename]

(replace “Directory Path” with the path to the file)

Example:

split -l 60000 access.log

Would split the data lines in access.log into separate files each containing 60,000 lines of data. It would start at the first line in the file and end at the last line in the file.

New files are autocreated names similar to #access.log where # would be replaced with the file chunk number.

Merge Files into Columns

paste -d 'delimiter' file1 file2 > newfile

(Replace delimiter with the column separation character or code e.g a comma ‘,’ a space ‘ ‘ or a set of characters ‘xxxxx’)

Example:

paste -d ',' column1.txt column2.txt > merged.txt

Would read the data in column1.txt append a comma to its end then it would read the data in column2.txt and append each line of column2.txt onto an equivalent line number in column1.txt and enter the new combined lines into merged.txt with a comma added to the end of each line.

So if column1.txt contained the lines:

apple
lemon
cherry

and column2.txt contained the lines:

green
yellow
red

then merged.txt would be written:

apple,green,
lemon,yellow,
cherry,red

Shuffle Lines of Data

shuf input.txt > output.txt

Example:

shuf movies.txt > movies_shuffled.txt

The data lines contained within movies.txt would be shuffled and placed into movies_shuffled.txt.

The output file is autocreated. The data in the input file is untouched and remains readable.

Sort Files

sort input.txt > output.txt

Example:

sort movies-shuffled.txt > movies-sorted.txt

All data lines in movies-shuffled.txt would be sorted into alphanumeric order and put into movies-sorted.txt.

The output file is automatically created. Non of the data in the input file is altered.

Data Collation/Data Extraction

grep "criteria" sourcefile.txt > destinationfile.txt

Example:

grep "<b>tooth: </b>" mouth.txt > fairy_pocket.txt

Any line in mouth.txt containing the string (text/numbers) tooth:  would be copied into the file fairy_pocket.txt.

The data in the source file is left intact (nothing is removed). The destination file is autocreated. The quotation marks are essential parts of this command.

Data Replacement

sed 's/StringToTeplace/ReplacementString/g' source.txt > destination.txt

Example:

sed 's/house/home/g' mortgage.txt > owned.txt

Every occurrence of “house” in mortgage.txt would be changed to “home” when the data is copied to destination.txt

The data in the source file is left intact (nothing is removed). The destination file is autocreated.

ALL occurrences of the string being replaced will be substituted with the replacement string. In the above example “house” is replaced with “home” whether it occurs as a single entity or as part of another entity: “warehouse” would become “warehome“.

To delete data from a file and not replace it with anything, use

sed 's/tobedeleted//g' old.txt > new.txt

Sometimes the command requires the “-e” option e.g

sed -e 's/house/home/g' mortgage.txt owned.txt

Sometimes the forward slashes, “/”, must be replaced with these # e.g

sed 's#<b>number</b>#<b>numbers</b>#g' number.txt > numbers.txt

(number would be replaced with numbers)

Delete the Nth character within every lines

(Replace # with the position to be removed)

sed 's/^(.{#}).(.*)/12/' sourcefile > outputfile

Example One:

sed 's/^(.{0}).(.*)/12/' log.txt > log2.txt

The contents of log.txt would be copied to log2.txt minus the first charactor of every data line e.g abcde would be copied as bcde

Example Two:

sed 's/^(.{3}).(.*)/12/' log.txt > log2.txt

The contents of log.txt would be copied to log2.txt minus the fourth charactor of every data line e.g abcde would be copied as abce

The original file’s data remains unedited. The output file is automatically created.

Notice that the first charactor has the place number of “0“.

Delete the FIRST N characters of every line

(replace # with the number of characters to be removed)

sed 's .{#} ' source.txt > destination.shtml

Example:

sed 's .{10} ' source.txt > destination.shtml

Would copy all the data in source.txt over to destination.shtml minus the first 10 characters of every line of data.

The original file’s data remains unedited. The output file is automatically created.

Similarly:

sed -r 's .{2} ' source.txt > destination.shtml

(replace “2” with the number of characters to be removed)

and

sed 's .. ' source.txt > destination.shtml

(use dots (“.”) to represent each character to be removed.

Delete the LAST N characters of every line

(replace # with the number of characters to be removed)

sed 's/.{#}$//g' source.txt > destination.shtml

Example:

sed 's/.{8}$//g' mrfingers.txt > mr2fingers.shtml

Would copy the data from mrfingers.txt to mr2fingers.shtml minus the last 8 characters from every data line e.g 1234567890 would become 12.

The source file remains intact. The destination file is automatically created.

Add text to the END of a line

sed 's/$/text to add/g' source.txt > destination.txt

Example:

sed 's/$/ end of line/g' old_file.txt > new_file.txt

Would copy the data from old_file.txt to new_file.txt and append the text ” end of line” (with the space) to the end of every line of data in new_file.txt

The source file remains in tact. The destination file is automatically created.

Add text to the BEGINNING of a line

sed 's/^/text to add/g' source.txt > destination.txt

Example:

sed 's/^/beginning of line /g' old_file.txt > new_file.txt

Would copy the data from old_file.txt to new_file.txt and append the text “beginning of line ” (with the space) to the beginning of every line of data in new_file.txt

The source file remains in tact. The destination file is automatically created.

Delete Duplicate Lines of Data

uniq source.txt > destination.txt

Example:

uniq duplicate.txt > unique.txt

Would read the data in duplicate.txt, strip it of duplicate lines (keeping only one of each) then copy it to unique.txt

The data in the source file must be sorted using the sort command before the uniq command is issued.

Another method, that doesn’t require the use of two commands is

sort -u source.txt > destination.txt

This method automatically sorts the data in the source file then reposts unique occurrences of the data lines into destination.txt

A third method, a better but harder to remember method, that doesn’t require the data to be sorted is

awk '!x[$0]++' source.txt > destination.txt

For the above commands, the source file remains untouched and the destination file is automatically created.

Usage Tip

Because of the way the Linux command line works, any of the above commands can be written into a copy & paste script i.e you can write a list of instructions with each instruction being on its own line, save it to a file then whenever you need to issue those same commands you can reopen that file then copy and paste the instruction set into the Linux command line.

Example:

grep "userID" AllUserIDs.txt > SpecificUserID.txt
sort SpecificUserID.txt > OrganisedActionsByUserID.txt
sed 's/UnwantedData//g' OrganisedActionsByUserID.txt > SpecificUserID.txt

Grep would search the file AllUserIDs.txt for any line of data containing the text “userID“. Lines containing the data “userID” would be copied to a new (autocreated) file called SpecificUserID.txt . This is accomplished by the command grep “userID” AllUserIDs.txt > SpecificUserID.txt

Next, the sort command would read the data in SpecificUserID.txt and sort it alphanumerically. This sorted data would be posted to an autocreated file called OrganisedActionsByUserID.txt .

Finally, sed would read the file OrganisedActionsByUserID.txt , clean it of “UnwantedData” then post it to an autocreated file called SpecificUserID.txt. The original SpecificUserID.txt would be overwritten by the new one created by sed.