Does anyone have any example Arc script that uses the fileReadLine operation to walk through an input file, parsing text line-by-line?
Hi Walter,
The fileReadLine operation operates like an enumerator, returning one result for every line in the file, using the new line character as a separator.
Passing in the following file:
red
orange
yellow
green
blue
purple
Through this code in a Script connector:
<arc:set attr="file.file" value="eFilePath]" />
<arc:call op="fileReadLine" item="file">
<!-- the file:data attr in the output item has the line -->
<arc:set attr="output.filename" value="efile.file:data]" />
<arc:push item="output" />
</arc:call>
Would output 6 files for each line in the document.
Here’s a Script I just wrote which removes duplicate lines from any text file.
Someone may find it useful.
I create a hash of the entire line and store each line in an attribute using the hash as the attribute name. If the line appears again it will overwrite the previous copy (but being the same there is no material effect).
I then concatenate all the attributes together using a newline.
NOTE: The output file will be ordered differently to the input file as the ordering is based on the attribute name
<!-- NOTE: Do not edit arc:info -->
<arc:info title="Custom Script" desc="This script will be executed when a file is processed.">
<input name="ConnectorId" desc="The id of this connector." />
<input name="WorkspaceId" desc="The workspace of this connector." />
<input name="MessageId" desc="The message id." />
<input name="FilePath" desc="The path of the file being processed." />
<input name="FileName" desc="The name of the file being processed." />
<input name="Attachment#" desc="The path of the attachment being processed." />
<input name="Header:*" desc="The message headers of the file being processed." />
<output name="Data" desc="The data that will be written to the file in the Receive folder." />
<output name="Encoding" desc="The encoding that will be used to write the file in the Receive folder." />
<output name="FileName" desc="The name of the output file in Receive folder." />
<output name="FilePath" desc="The path of file that will be written to Receive folder." />
<output name="Header:*" desc="The message headers of the file being written." />
</arc:info>
<!--
<arc:set attr="_log.info">
Incoming file = =filepath]
</arc:set>
-->
<arc:set attr="file.file" value="=FilePath]" />
<arc:set item="newrows"/>
<arc:call op="fileReadLine" item="file">
<!--<arc:set attr="_log.info">
file.file:data|md5hash(false)] = =file.file:data]
</arc:set>
-->
<!--<arc:set item="newrows" attr="=file.file:data|md5hash(false)]" value="=file.file:data|md5hash(false)]\|\file.file:data]" />-->
<arc:set item="newrows" attr="=file.file:data|md5hash(false)]" value="=file.file:data]" />
</arc:call>
<arc:set attr="output.data" value=""/>
<arc:enum item="newrows" attr="*">
<arc:first>
<arc:set attr="output.data" value="=_value]"/>
</arc:first>
<arc:set attr="output.data" value="=output.data]\n\_value]"/>
</arc:enum>
<arc:set attr="output.filepath" value="=filepath]" />
<arc:push item="output" />
That’s a neat example - do you mind if we repurpose this script into its own topic (We’ll credit the example to you)? I’ve made some minor changes to the script to minimize the code (note that you can stick an enum inside of a set to continue the call).
<arc:set attr="file.file" value=" FilePath]" />
<arc:call op="fileReadLine" item="file">
<!-- enumerate the file and add the rows to a collection -->
<arc:set item="newrows" attr=""file.file:data|md5hash(false)]" value=" file.file:data]" />
</arc:call>
<!-- repopulate the row data -->
<arc:set attr="output.data">
<arc:enum item="newrows">o_value]\n</arc:enum>
</arc:set>
<arc:push item="output" />
There is a slight difference in our code (there’s a trailing \n in my edit), and one thing to note is that the order of the lines isn’t preserved here, but using this, I was able to pass in this file:
Apple
Banana
Cherry
Banana
Date
Eggplant
And output this content:
Date
Cherry
Apple
Eggplant
Banana
Hi James, yep feel free to share.
I have a new version which should retain the line order:
I use a single attribute now to maintain a duplicate lookup (which might struggle on large files).
I use a separate item to store the non-duplicates, using the >_index] as the attribute name
I haven’t reworked it yet to use your new version, James.
<arc:set attr="file.file" value="eFilePath]" />
<arc:set item="newrows"/>
<arc:set item="output" attr="dupetracker" value="~"/>
<arc:call op="fileReadLine" item="file">
<arc:set attr="thisline" value="efile.file:data|trim]"/>
<arc:set attr="thislinehash" value="ethisline|md5hash(false)]"/>
<arc:if exp="poutput.dupetracker | contains(nthislinehash])]">
<!-- do nothing if we have already seen this line -->
<arc:else>
<arc:set attr="output.dupetracker" value="eoutput.dupetracker]~rthislinehash]" />
<arc:set item="newrows" attr="r_index]" value="ethisline]" />
</arc:else>
</arc:if>
</arc:call>
<arc:set attr="output.data" value=""/>
<arc:enum item="newrows" attr="*">
<arc:set attr="output.data" value="eoutput.data]\n]_value]"/>
</arc:enum>
Reply
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.